[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376748#comment-16376748 ]
Rupert Westenthaler commented on LUCENE-8183: --------------------------------------------- FYI: I pan to spend some time to implement a version of the DictionaryCompoundWordTokenFilter that adds options for * `noSub`: no tokens are added the are completely enclosed by an longer (`fußballpumpe`: `fußball`, `ballpumpe`) * `noOverlap`: no overlapping tokens (`fußballpumpe`; `fußball`, `pumpe`) IMO the simplest way is to first emit all tokens and later filter those based on the active options (`onlyLongestMatch`, `noSub`, `noOverlap`). Regarding the test: Providing good test examples is hard as the current test cases are based on a Danish and I do not speak this language Providing examples in German would be easy, but this would require a German hyphenator and the file is licensed under the LaTeX Project Public License and can therefore not be included in the source. Given suitable examples the implementation of the actual test seams to be rather easy as they can be implemented similar to the existing test cases > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > ------------------------------------------------------------------------------------------- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.6 > Environment: Configuration of the analyzer: > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HyphenationCompoundWordTokenFilterFactory" > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > > Reporter: Rupert Westenthaler > Assignee: Uwe Schindler > Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org