[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374886#comment-16374886 ]
Matthias Krueger commented on LUCENE-8183: ------------------------------------------ [~rwesten] Quick question regarding your patch: What's the reasoning behind not decomposing terms that are part of the dictionary at all? The {{onlyLongestMatch}} flag currently affects whether all matches or only the longest match should be returned *per* *start* character (in DictionaryCompoundWordTokenFilter) or *per* hyphenation *start* point (in HyphenationCompoundWordTokenFilter). Example: Dictionary {{"Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft"}} for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position. I like the idea of restricting this further to only the longest terms that *touch* a certain hyphenation point. This would exclude "schaft" in the example above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the respective hyphenation point). On the other hand, there might be examples where you still want to include the "overlapping" tokens. For "Fußballpumpe" and dictionary {{"Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball"}} you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German. Regarding point 1: The current algorithm always returns the term itself again if it's part of the dictionary. I guess, this could be changed if we don't check against {{this.maxSubwordSize}} but against {{Math.min(this.maxSubwordSize), termAtt.length()-1)}} Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm? > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > ------------------------------------------------------------------------------------------- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.6 > Environment: Configuration of the analyzer: > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HyphenationCompoundWordTokenFilterFactory" > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > > Reporter: Rupert Westenthaler > Assignee: Uwe Schindler > Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org