Matthias Krueger commented on LUCENE-8183:

[~rwesten] Quick question regarding your patch: What's the reasoning behind not 
decomposing terms that are part of the dictionary at all?

The {{onlyLongestMatch}} flag currently affects whether all matches or only the 
longest match should be returned *per* *start* character (in 
DictionaryCompoundWordTokenFilter) or *per* hyphenation *start* point (in 

 Dictionary {{"Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft"}} for 
input "Wirtschaftswissenschaft" will return the original input plus tokens 
"Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". 
"schaft" is still returned (even twice) because it's the longest token starting 
at the respective position.

I like the idea of restricting this further to only the longest terms that 
*touch* a certain hyphenation point. This would exclude "schaft" in the example 
above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the 
respective hyphenation point). On the other hand, there might be examples where 
you still want to include the "overlapping" tokens. For "Fußballpumpe" and 
dictionary {{"Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball"}} you would get the 
tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been 
considered part of Fußball. Also, not sure if your change also improves the 
situation for languages other than German.

Regarding point 1: The current algorithm always returns the term itself again 
if it's part of the dictionary. I guess, this could be changed if we don't 
check against {{this.maxSubwordSize}} but against 
{{Math.min(this.maxSubwordSize), termAtt.length()-1)}}

Perhaps these kind of adjustments should rather be done in a TokenFilter 
similar to RemoveDuplicatesTokenFilter instead of complicating the 
decompounding algorithm?

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>                 Key: LUCENE-8183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>         Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
>         hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>            Reporter: Rupert Westenthaler
>            Assignee: Uwe Schindler
>            Priority: Major
>         Attachments: LUCENE-8183_20180223_rwesten.diff
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to