[jira] [Updated] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

Tomoko Uchida (Jira) Sun, 26 Jan 2020 08:58:18 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tomoko Uchida updated LUCENE-9173:
----------------------------------
    Description: 
This is a derived issue from LUCENE-9123.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
emit multiple tokens at the same position, SynonymGraphFilter cannot correctly 
handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
decompounded tokens for the text "株式会社":
{code:java}
株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)
{code}
Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set 
tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
{code:java}
Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
(株式会社) with position increment != 1 (got: 0)
        at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
{code}
This isn't only limited to JapaneseTokenizer but a more general issue about 
handling branched token graph (decompounded tokens in the midstream).

  was:
This is a derived issue from LUCENE-9123.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
emit multiple tokens at the same position, SynonymGraphFilter cannot correctly 
handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
decompounded tokens for the text "株式会社":
{code:java}
株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)
{code}
Then if we give synonym "株式会社,コーポレーション" by SynonymGraphFilter (set 
tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
{code:java}
Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
(株式会社) with position increment != 1 (got: 0)
        at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
{code}
This isn't only limited to JapaneseTokenizer but a more general issue about 
handling branched token graph (decompounded tokens in the midstream).


> SynonymGraphFilter doesn't correctly consume decompounded tokens  (branched 
> token graph)
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-9173
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9173
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Minor
>
> This is a derived issue from LUCENE-9123.
> When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
> emit multiple tokens at the same position, SynonymGraphFilter cannot 
> correctly handle them (an exception will be thrown).
> For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
> decompounded tokens for the text "株式会社":
> {code:java}
> 株式会社 (positionIncrement=0, positionLength=2)
> 株式 (positionIncrement=1, positionLength=1)
> 会社 (positionIncrement=1, positionLength=1)
> {code}
> Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set 
> tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
> {code:java}
> Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
> (株式会社) with position increment != 1 (got: 0)
>       at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
>  ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
> bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
>       at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
> bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
>       at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
> bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
>       at 
> org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
>  ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
> bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
>       at 
> org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
>  ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
> bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
> {code}
> This isn't only limited to JapaneseTokenizer but a more general issue about 
> handling branched token graph (decompounded tokens in the midstream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

Reply via email to