[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated LUCENE-8183:
----------------------------------------
    Description: 
The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
onlyLongestMatch is enabled. 

Example:

Dictionary: {{gesellschaft}}, {{schaft}}
 Hyphenator: {{de_DR.xml}} //from Apche Offo
 onlyLongestMatch: true

 
|text|gesellschaft|gesellschaft|schaft|
|raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 61 
66 74]|[73 63 68 61 66 74]|
|start|0|0|0|
|end|12|12|12|
|positionLength|1|1|1|
|type|word|word|word|
|position|1|1|1|

IMHO this includes 2 unexpected Tokens
 # the 2nd 'gesellschaft' as it duplicates the original token
 # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
dictionary

 

  was:
The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
onlyLongestMatch is enabled. 

Example:

Dictionary: {{gesellschaft}}, {{schaft}}
 Hyphenator: {{de_DR.xml}} //from Apche Offo
onlyLongestMatch: true

 
|HCWTF|
|
|text|
|raw_bytes|
|start|
|end|
|positionLength|
|type|
|position|
|
|
|
|gesellschaft|
|[67 65 73 65 6c 6c 73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|
|
|gesellschaft|
|[67 65 73 65 6c 6c 73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|
|
|schaft|
|[73 63 68 61 66 74]|
|0|
|12|
|1|
|word|
|1|
|
|

IMHO this includes 2 unexpected Tokens
 # the 2nd 'gesellschaft' as it duplicates the original token
 # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
dictionary

 


> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>         Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
>         hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>            Reporter: Rupert Westenthaler
>            Priority: Major
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to