[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

JIRA Thu, 14 Apr 2011 02:38:49 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Johann Höchtl updated LUCENE-3022:
----------------------------------

    Attachment: LUCENE-3022.patch

Patch fixing this issue
including JUnitTest

> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann Höchtl
>            Priority: Minor
>         Attachments: LUCENE-3022.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I 
> got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to 
> "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is 
> longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>     
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>     
>     for (int i=0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=null;
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             } 
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=null) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to 
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=null;
>     for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=null) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch 
> Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

Reply via email to