[
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892671#comment-16892671
]
Jim Ferenczi commented on LUCENE-8933:
--------------------------------------
The first argument of the dictionary rule is the original block to detect and
the second argument is the segmentation for the block. So the rule "aaa,aa a,,"
will split the input "aaa" into two tokens "aa" and "a". When computing the
offsets of the splitted terms in the user dictionary we assume that the
segmentation has the same character than the input minus the whitespaces. We
don't check that is the case so rules with broken offsets are only detected
when they match in a token stream. You don't need emojis or surrogate pairs to
break this, just provide a rule where the length of the segmentation is greater
than the input minus the whitespaces:
{code:java}
UserDictionary dict = UserDictionary.open(new StringReader("aaa,aaaa,,"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("aaa"));
tok.reset();
tok.incrementToken();
{code}
I think we just need to validate the input and throw an exception if the
assumption are not met at build time.
> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing
> synonyms. It looks like the only reason why this might occur is if the offset
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
> ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f -
> nknize - 2018-12-07 14:44:20]
> at
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
> ~[?:?]
> at
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]