[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Rupert Westenthaler (JIRA) Mon, 26 Feb 2018 04:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376748#comment-16376748
 ]


Rupert Westenthaler commented on LUCENE-8183:
---------------------------------------------

FYI: I pan to spend some time to implement a version of the 
DictionaryCompoundWordTokenFilter that adds options for

* `noSub`: no tokens are added the are completely enclosed by an longer 
(`fußballpumpe`: `fußball`, `ballpumpe`)
* `noOverlap`: no overlapping tokens (`fußballpumpe`; `fußball`, `pumpe`)

IMO the simplest way is to first emit all tokens and later filter those based 
on the active options (`onlyLongestMatch`, `noSub`, `noOverlap`).


Regarding the test:

Providing good test examples is hard as the current test cases are based on a 
Danish and I do not speak this language
Providing examples in German would be easy, but this would require a German 
hyphenator and the file is licensed under the LaTeX Project Public License and 
can therefore not be included in the source.
Given suitable examples the implementation of the actual test seams to be 
rather easy as they can be implemented similar to the existing test cases

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>         Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
>         hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>            Reporter: Rupert Westenthaler
>            Assignee: Uwe Schindler
>            Priority: Major
>         Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Reply via email to