[ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022351#comment-13022351 ]
Johann Höchtl commented on LUCENE-3022: --------------------------------------- Let's stay with this example: Dict: {"soft","so","ft"} Word: "softball" The first option onlyLongestMatch, which behaves in the way, that only the longest matching dictionary entry should be returned. (Should this option be modified to keep all of the longest matches? length(soft) == length(ball)?) Output: "soft" if true; "so","ft","soft" if false The second option should be keepRemain, which makes a term out of the remain after substracting the longestMatch (makes only sense with onlyLongestMatch!?) Output: "soft","ball" if keepRemain==onlyLongestMatch==true With this second option you could keep the remains, which are not in your dictionary (reduces the complexity of the required dictionary and can improve the compound-splitting-logic) > DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect > --------------------------------------------------------------------- > > Key: LUCENE-3022 > URL: https://issues.apache.org/jira/browse/LUCENE-3022 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Affects Versions: 2.9.4, 3.1 > Reporter: Johann Höchtl > Assignee: Robert Muir > Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3022.patch, LUCENE-3022.patch > > Original Estimate: 5m > Remaining Estimate: 5m > > When using the DictionaryCompoundWordTokenFilter with a german dictionary, I > got a strange behaviour: > The german word "streifenbluse" (blouse with stripes) was decompounded to > "streifen" (stripe),"reifen"(tire) which makes no sense at all. > I thought the flag onlyLongestMatch would fix this, because "streifen" is > longer than "reifen", but it had no effect. > So I reviewed the sourcecode and found the problem: > [code] > protected void decomposeInternal(final Token token) { > // Only words longer than minWordSize get processed > if (token.length() < this.minWordSize) { > return; > } > > char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer()); > > for (int i=0;i<token.length()-this.minSubwordSize;++i) { > Token longestMatchToken=null; > for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) { > if(i+j>token.length()) { > break; > } > if(dictionary.contains(lowerCaseTermBuffer, i, j)) { > if (this.onlyLongestMatch) { > if (longestMatchToken!=null) { > if (longestMatchToken.length()<j) { > longestMatchToken=createToken(i,j,token); > } > } else { > longestMatchToken=createToken(i,j,token); > } > } else { > tokens.add(createToken(i,j,token)); > } > } > } > if (this.onlyLongestMatch && longestMatchToken!=null) { > tokens.add(longestMatchToken); > } > } > } > [/code] > should be changed to > [code] > protected void decomposeInternal(final Token token) { > // Only words longer than minWordSize get processed > if (token.termLength() < this.minWordSize) { > return; > } > char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer()); > Token longestMatchToken=null; > for (int i=0;i<token.termLength()-this.minSubwordSize;++i) { > for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) { > if(i+j>token.termLength()) { > break; > } > if(dictionary.contains(lowerCaseTermBuffer, i, j)) { > if (this.onlyLongestMatch) { > if (longestMatchToken!=null) { > if (longestMatchToken.termLength()<j) { > longestMatchToken=createToken(i,j,token); > } > } else { > longestMatchToken=createToken(i,j,token); > } > } else { > tokens.add(createToken(i,j,token)); > } > } > } > } > if (this.onlyLongestMatch && longestMatchToken!=null) { > tokens.add(longestMatchToken); > } > } > [/code] > So, that only the longest token is really indexed and the onlyLongestMatch > Flag makes sense. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org