[jira] [Commented] (LUCENE-7916) CompositeBreakIterator is brittle under ICU4J upgrade.

Robert Muir (JIRA) Wed, 02 Aug 2017 17:36:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111977#comment-16111977
 ]


Robert Muir commented on LUCENE-7916:
-------------------------------------

{quote}
In our case, we are using ICUTokenizer but we have modified the default ruleset 
of RuleBasedBreakIterator to break on emoji characters so that we can search 
for emoji in text.
{quote}

Cool!

{quote}
The underlying issue for us is that Lucene 6.6.0 is pegged to a fairly old 
version of ICU. In hindsight it might have been safer for us to fork 
lucene-analyzers-icu temporarily to build our own internal release against ICU 
59.1.
{quote}

Yeah, when we upgrade ICU versions we run a script the regenerates 
normalization and segmentation datafiles for that specific ICU jar / unicode 
version: {{ant regenerate}} from lucene/analyzers/icu. So at the minimum this 
should really be done (followed of course by {{ant test}}) so that things work 
correctly. 

{quote}
>From what I've seen in JIRA and the git repo, it looks like 6.7 is targeted at 
>ICU 59.1. Is there an ETA for the release of 6.7?
{quote}

I'm not sure, maybe ask the dev list about this? But it seems most work is 
towards 7.0 and onwards. 

The real problem was falling so far behind on ICU versions. You can see why if 
you look at the ticket: LUCENE-7540. Mainly, a bug 
(http://bugs.icu-project.org/trac/ticket/12873) was introduced into ICU that 
our test suite detected but we didn't know why. This was fixed in ICU 59.1 so 
we were then able to upgrade.

> CompositeBreakIterator is brittle under ICU4J upgrade.
> ------------------------------------------------------
>
>                 Key: LUCENE-7916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7916
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>            Reporter: Chris Koenig
>         Attachments: LUCENE-7916.patch, LUCENE-7916.patch
>
>
> We use lucene-analyzers-icu version 6.6.0 in our project. Lucene 6.6.0 is 
> built against ICU4J version 56.1, but our use case requires us to use the 
> latest version of ICU4J, 59.1.
> The problem that we have encountered is that 
> CompositeBreakIterator.getBreakIterator(int scriptCode) throws an 
> ArrayIndexOutOfBoundsException for script codes higher than 167. In ICU4J 
> 56.1 the highest possible script code is 166, but in ICU4j 59.1 it is 174.
> Internally, CompositeBreakIterator is creating an array of size 
> UScript.CODE_LIMIT, but the value of CODE_LIMIT from ICU4J 56.1 is being 
> baked into the bytecode by the compiler. So even after overriding the version 
> of the ICU4J dependency to 59.1 in our project, this array will still be size 
> 167, which is too small.
> {code}
> final class CompositeBreakIterator {
>   private final ICUTokenizerConfig config;
>   private final BreakIteratorWrapper wordBreakers[] = new 
> BreakIteratorWrapper[UScript.CODE_LIMIT];
> {code}
> Output of javap run on CompositeBreakIterator.class from 
> lucene-analyzers-icu-6.6.0.jar
> {code}
> Compiled from "CompositeBreakIterator.java"
> final class 
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator {
>   
> org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig);
>     descriptor: 
> (Lorg/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig;)V
>     Code:
>        0: aload_0
>        1: invokespecial #1                  // Method 
> java/lang/Object."<init>":()V
>        4: aload_0
>        5: sipush        167
>        8: anewarray     #3                  // class 
> org/apache/lucene/analysis/icu/segmentation/BreakIteratorWrapper
> {code}
> In our case, the ArrayIndexOutOfBoundsException was triggered when we 
> encountered a stray character of the Bhaiksuki script (script code 168) in a 
> chunk of text that we processed.
> CompositeBreakIterator can be made more resilient by changing the type of 
> wordBreakers from an array to a Map and no longer relying on the value of 
> UScript.CODE_LIMIT.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7916) CompositeBreakIterator is brittle under ICU4J upgrade.

Reply via email to