[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892644#comment-16892644 ]
Adrien Grand commented on LUCENE-8933: -------------------------------------- Ah, thanks for digging [~tomoko] and [~danmuzi]. I simplified Tomoko's recreation a bit more: {code:java} UserDictionary dict = UserDictionary.open(new StringReader("アメリカン航空,アメ🙂カン航空,アメリカンコウクウ,カスタム用語")); JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL); tok.setReader(new StringReader("アメリカン航空")); tok.reset(); tok.incrementToken(); {code} Tomoko, I wonder that the fact that the issue doesn't occur when the emoji is at other positions might be due to the fact that the Position class initializes its buffers' sizes to 8? > JapaneseTokenizer creates Token objects with corrupt offsets > ------------------------------------------------------------ > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org