[ 
https://issues.apache.org/jira/browse/LUCENE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-7916.
---------------------------------
    Resolution: Fixed

Thanks [~ckoenig42] !

> CompositeBreakIterator is brittle under ICU4J upgrade.
> ------------------------------------------------------
>
>                 Key: LUCENE-7916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7916
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>            Reporter: Chris Koenig
>             Fix For: master (8.0), 7.1
>
>         Attachments: LUCENE-7916.patch, LUCENE-7916.patch
>
>
> We use lucene-analyzers-icu version 6.6.0 in our project. Lucene 6.6.0 is 
> built against ICU4J version 56.1, but our use case requires us to use the 
> latest version of ICU4J, 59.1.
> The problem that we have encountered is that 
> CompositeBreakIterator.getBreakIterator(int scriptCode) throws an 
> ArrayIndexOutOfBoundsException for script codes higher than 167. In ICU4J 
> 56.1 the highest possible script code is 166, but in ICU4j 59.1 it is 174.
> Internally, CompositeBreakIterator is creating an array of size 
> UScript.CODE_LIMIT, but the value of CODE_LIMIT from ICU4J 56.1 is being 
> baked into the bytecode by the compiler. So even after overriding the version 
> of the ICU4J dependency to 59.1 in our project, this array will still be size 
> 167, which is too small.
> {code}
> final class CompositeBreakIterator {
>   private final ICUTokenizerConfig config;
>   private final BreakIteratorWrapper wordBreakers[] = new 
> BreakIteratorWrapper[UScript.CODE_LIMIT];
> {code}
> Output of javap run on CompositeBreakIterator.class from 
> lucene-analyzers-icu-6.6.0.jar
> {code}
> Compiled from "CompositeBreakIterator.java"
> final class 
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator {
>   
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig);
>     descriptor: 
> (Lorg/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig;)V
>     Code:
>        0: aload_0
>        1: invokespecial #1                  // Method 
> java/lang/Object."<init>":()V
>        4: aload_0
>        5: sipush        167
>        8: anewarray     #3                  // class 
> org/apache/lucene/analysis/icu/segmentation/BreakIteratorWrapper
> {code}
> In our case, the ArrayIndexOutOfBoundsException was triggered when we 
> encountered a stray character of the Bhaiksuki script (script code 168) in a 
> chunk of text that we processed.
> CompositeBreakIterator can be made more resilient by changing the type of 
> wordBreakers from an array to a Map and no longer relying on the value of 
> UScript.CODE_LIMIT.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to