[jira] [Commented] (LUCENE-7916) CompositeBreakIterator is brittle under ICU4J upgrade.

Chris Koenig (JIRA) Wed, 02 Aug 2017 11:06:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111438#comment-16111438
 ]


Chris Koenig commented on LUCENE-7916:
--------------------------------------

Thanks for your feedback. I didn't realize how tightly coupled Lucene is to a 
particular ICU release.

In our case, we are using ICUTokenizer but we have modified the default ruleset 
of RuleBasedBreakIterator to break on emoji characters so that we can search 
for emoji in text. The unicode properties for emoji that our rules depend on 
were added to UProperty starting with ICU 57. Because we are compiling our own 
RBBI rules, we are not exposed to any breakage that might occur due to a binary 
rule encoding change on upgrade of ICU. We are not making use of the Normalizer 
or Folding filters so we lack exposure there as well. After thorough A/B 
testing, this is working well for us in production with the exception of the 
issue reported above, which has only occurred once so far.

The underlying issue for us is that Lucene 6.6.0 is pegged to a fairly old 
version of ICU. In hindsight it might have been safer for us to fork 
lucene-analyzers-icu temporarily to build our own internal release against ICU 
59.1.

>From what I've seen in JIRA and the git repo, it looks like 6.7 is targeted at 
>ICU 59.1. Is there an ETA for the release of 6.7?

> CompositeBreakIterator is brittle under ICU4J upgrade.
> ------------------------------------------------------
>
>                 Key: LUCENE-7916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7916
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>            Reporter: Chris Koenig
>         Attachments: LUCENE-7916.patch
>
>
> We use lucene-analyzers-icu version 6.6.0 in our project. Lucene 6.6.0 is 
> built against ICU4J version 56.1, but our use case requires us to use the 
> latest version of ICU4J, 59.1.
> The problem that we have encountered is that 
> CompositeBreakIterator.getBreakIterator(int scriptCode) throws an 
> ArrayIndexOutOfBoundsException for script codes higher than 167. In ICU4J 
> 56.1 the highest possible script code is 166, but in ICU4j 59.1 it is 174.
> Internally, CompositeBreakIterator is creating an array of size 
> UScript.CODE_LIMIT, but the value of CODE_LIMIT from ICU4J 56.1 is being 
> baked into the bytecode by the compiler. So even after overriding the version 
> of the ICU4J dependency to 59.1 in our project, this array will still be size 
> 167, which is too small.
> {code}
> final class CompositeBreakIterator {
>   private final ICUTokenizerConfig config;
>   private final BreakIteratorWrapper wordBreakers[] = new 
> BreakIteratorWrapper[UScript.CODE_LIMIT];
> {code}
> Output of javap run on CompositeBreakIterator.class from 
> lucene-analyzers-icu-6.6.0.jar
> {code}
> Compiled from "CompositeBreakIterator.java"
> final class 
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator {
>   
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig);
>     descriptor: 
> (Lorg/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig;)V
>     Code:
>        0: aload_0
>        1: invokespecial #1                  // Method 
> java/lang/Object."<init>":()V
>        4: aload_0
>        5: sipush        167
>        8: anewarray     #3                  // class 
> org/apache/lucene/analysis/icu/segmentation/BreakIteratorWrapper
> {code}
> In our case, the ArrayIndexOutOfBoundsException was triggered when we 
> encountered a stray character of the Bhaiksuki script (script code 168) in a 
> chunk of text that we processed.
> CompositeBreakIterator can be made more resilient by changing the type of 
> wordBreakers from an array to a Map and no longer relying on the value of 
> UScript.CODE_LIMIT.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7916) CompositeBreakIterator is brittle under ICU4J upgrade.

Reply via email to