[ https://issues.apache.org/jira/browse/LUCENE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved LUCENE-7916. --------------------------------- Resolution: Fixed Thanks [~ckoenig42] ! > CompositeBreakIterator is brittle under ICU4J upgrade. > ------------------------------------------------------ > > Key: LUCENE-7916 > URL: https://issues.apache.org/jira/browse/LUCENE-7916 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.6 > Reporter: Chris Koenig > Fix For: master (8.0), 7.1 > > Attachments: LUCENE-7916.patch, LUCENE-7916.patch > > > We use lucene-analyzers-icu version 6.6.0 in our project. Lucene 6.6.0 is > built against ICU4J version 56.1, but our use case requires us to use the > latest version of ICU4J, 59.1. > The problem that we have encountered is that > CompositeBreakIterator.getBreakIterator(int scriptCode) throws an > ArrayIndexOutOfBoundsException for script codes higher than 167. In ICU4J > 56.1 the highest possible script code is 166, but in ICU4j 59.1 it is 174. > Internally, CompositeBreakIterator is creating an array of size > UScript.CODE_LIMIT, but the value of CODE_LIMIT from ICU4J 56.1 is being > baked into the bytecode by the compiler. So even after overriding the version > of the ICU4J dependency to 59.1 in our project, this array will still be size > 167, which is too small. > {code} > final class CompositeBreakIterator { > private final ICUTokenizerConfig config; > private final BreakIteratorWrapper wordBreakers[] = new > BreakIteratorWrapper[UScript.CODE_LIMIT]; > {code} > Output of javap run on CompositeBreakIterator.class from > lucene-analyzers-icu-6.6.0.jar > {code} > Compiled from "CompositeBreakIterator.java" > final class > org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator { > > org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig); > descriptor: > (Lorg/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig;)V > Code: > 0: aload_0 > 1: invokespecial #1 // Method > java/lang/Object."<init>":()V > 4: aload_0 > 5: sipush 167 > 8: anewarray #3 // class > org/apache/lucene/analysis/icu/segmentation/BreakIteratorWrapper > {code} > In our case, the ArrayIndexOutOfBoundsException was triggered when we > encountered a stray character of the Bhaiksuki script (script code 168) in a > chunk of text that we processed. > CompositeBreakIterator can be made more resilient by changing the type of > wordBreakers from an array to a Map and no longer relying on the value of > UScript.CODE_LIMIT. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org