[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375147#comment-16375147 ]
Uwe Schindler commented on LUCENE-8183: --------------------------------------- bq. I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place. I think it's good to not add them in the first place. The change is quite simple, so it can be done here. And it does not really complicate the algorithm as its done at one separated place. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > ------------------------------------------------------------------------------------------- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.6 > Environment: Configuration of the analyzer: > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HyphenationCompoundWordTokenFilterFactory" > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > > Reporter: Rupert Westenthaler > Assignee: Uwe Schindler > Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org