[jira] [Commented] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

JIRA Wed, 20 Apr 2011 13:28:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022351#comment-13022351
 ]


Johann Höchtl commented on LUCENE-3022:
---------------------------------------

Let's stay with this example:
Dict: {"soft","so","ft"}
Word: "softball"

The first option onlyLongestMatch, which behaves in the way, that only the 
longest matching dictionary entry should be returned. (Should this option be 
modified to keep all of the longest matches? length(soft) == length(ball)?)
Output: "soft" if true; "so","ft","soft" if false

The second option should be keepRemain, which makes a term out of the remain 
after substracting the longestMatch (makes only sense with onlyLongestMatch!?)
Output: "soft","ball" if keepRemain==onlyLongestMatch==true

With this second option you could keep the remains, which are not in your 
dictionary (reduces the complexity of the required dictionary and can improve 
the compound-splitting-logic)


> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann Höchtl
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I 
> got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to 
> "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is 
> longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>     
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>     
>     for (int i=0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=null;
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             } 
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=null) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to 
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=null;
>     for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=null) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch 
> Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

Reply via email to