[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893288#comment-16893288 ]
Tomoko Uchida commented on LUCENE-8933: --------------------------------------- Just for clarification, let me wrap up the problem here. - JapaneseTokenizer has "search mode", which break up dictionary tokens (surface forms) to small segments and matches the input text to the segments (for increasing search recall). - The user dictionary of JapaneseTokenizer allows users to specify arbitrary segmentation rules in addition to add custom tokens. - e.g.: If an user entry {{"aabbcc,aa bb cc,aa bb cc,pos_tag"}} is given, the token stream for {{"aabbcc"}} should generate three tokens, {{"aa"}} {{"bb"}} {{"cc"}}. - The sum of length of segments are expected to be exactly same to the length of corresponding surface form (as [~jim.ferenczi] explained). If a segment is longer than its surface form, it's a violation against this assumption and causes an AIOOB when array copying the region of surface form. For purpose of format validation, I think it would be better that we check if the sum of length of segments is equal to the length of its surface form. i.e., we also should not allow such entry {{"aabbcc,a b c,aa bb cc,pos_tag"}} even if this does not cause any exceptions. > JapaneseTokenizer creates Token objects with corrupt offsets > ------------------------------------------------------------ > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org