[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Jim Ferenczi (JIRA) Thu, 25 Jul 2019 05:01:28 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892671#comment-16892671
 ]


Jim Ferenczi commented on LUCENE-8933:
--------------------------------------

The first argument of the dictionary rule is the original block to detect and 
the second argument is the segmentation for the block. So the rule "aaa,aa a,," 
will split the input "aaa" into two tokens "aa" and "a". When computing the 
offsets of the splitted terms in the user dictionary we assume that the 
segmentation has the same character than the input minus the whitespaces. We 
don't check that is the case so rules with broken offsets are only detected 
when they match in a token stream. You don't need emojis or surrogate pairs to 
break this, just provide a rule where the length of the segmentation is greater 
than the input minus the whitespaces:
{code:java}
UserDictionary dict = UserDictionary.open(new StringReader("aaa,aaaa,,"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("aaa"));
tok.reset();
tok.incrementToken();
{code}

I think we just need to validate the input and throw an exception if the 
assumption are not met at build time.



> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
>                 Key: LUCENE-8933
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8933
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>     at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
>     at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
>     at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Reply via email to