[
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892708#comment-16892708
]
Tomoko Uchida commented on LUCENE-8933:
---------------------------------------
Thanks for your explanation and investigation, I agree with this policy.
bq. You don't need emojis or surrogate pairs to break this, just provide a rule
where the length of the segmentation is greater than the input minus the
whitespaces:
Just for confirmation, this entry works without any problem. (Here, same Emoji
character appears both of first and second column. I think this should be
allowed because some surrogate pair Kanjis are often used in specific
situations like person names.)
{code:java}
UserDictionary dict = UserDictionary.open(new
StringReader("アメ🙂カン航空,アメ🙂カン航空,アメリカンコウクウ,カスタム用語"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("アメリカン航空"));
tok.reset();
tok.incrementToken();
{code}
> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing
> synonyms. It looks like the only reason why this might occur is if the offset
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
> ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f -
> nknize - 2018-12-07 14:44:20]
> at
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
> ~[?:?]
> at
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]