[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892562#comment-16892562 ]
Tomoko Uchida commented on LUCENE-8933: --------------------------------------- This pure lucene code produces the very similar error and stack trace. (I use Lucene version 7.7.0.) {code:java} ### userdict.txt # this entry causes the problem アメリカン航空,アメ🙂カン航空,アメリカンコウクウ,カスタム用語 # this entry causes the problem #アメリカン航空,アメ🙂カン 航空,アメリカン コウクウ,カスタム用語 # this entry does not cause the problem #アメ🙂カン航空,アメリカン航空,アメリカンコウクウ,カスタム用語 # this entry does not cause the problem #アメリカン航空,アメ🙂ン航空,アメリカンコウクウ,カスタム用語 {code} {code:java} ### synonyms.txt アメリカン航空,aa,アメリカン {code} {code:java} public class KuromojiAIOOB { public static void main(String[] args) { try { Map<String, String> args1 = new HashMap<>(); args1.put("mode", "normal"); args1.put("userDictionary", "userdict.txt"); args1.put("userDictionaryEncoding", "UTF-8"); args1.put("discardPunctuation", "false"); Map<String, String> args2 = new HashMap<>(); args2.put("synonyms", "synonyms.txt"); args2.put("format", "solr"); args2.put("tokenizerFactory", "org.apache.lucene.analysis.ja.JapaneseTokenizerFactory"); args2.put("mode", "normal"); args2.put("userDictionary", "userdict.txt"); args2.put("userDictionaryEncoding", "UTF-8"); args2.put("discardPunctuation", "false"); CustomAnalyzer analyzer = CustomAnalyzer.builder(Paths.get("lucene-8933", "conf")) .withTokenizer(JapaneseTokenizerFactory.class, args1) .addTokenFilter(SynonymGraphFilterFactory.class, args2) .build(); } catch (IOException e) { e.printStackTrace(); } } } {code} This error occurs when building the CustomAnalyzer: {code:java} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: last source index 8 out of bounds for char[7] at java.base/java.lang.System.arraycopy(Native Method) at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) at org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:174) at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:149) at org.apache.lucene.analysis.custom.CustomAnalyzer$Builder.applyResourceLoader(CustomAnalyzer.java:559) at org.apache.lucene.analysis.custom.CustomAnalyzer$Builder.addTokenFilter(CustomAnalyzer.java:336) at KuromojiAIOOB.main(KuromojiAIOOB.java:31) {code} > JapaneseTokenizer creates Token objects with corrupt offsets > ------------------------------------------------------------ > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org