[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Tomoko Uchida (JIRA) Thu, 25 Jul 2019 02:02:39 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892562#comment-16892562
 ]


Tomoko Uchida commented on LUCENE-8933:
---------------------------------------

This pure lucene code produces the very similar error and stack trace. (I use 
Lucene version 7.7.0.)
{code:java}
### userdict.txt
# this entry causes the problem
アメリカン航空,アメ🙂カン航空,アメリカンコウクウ,カスタム用語

# this entry causes the problem
#アメリカン航空,アメ🙂カン 航空,アメリカン コウクウ,カスタム用語

# this entry does not cause the problem
#アメ🙂カン航空,アメリカン航空,アメリカンコウクウ,カスタム用語

# this entry does not cause the problem
#アメリカン航空,アメ🙂ン航空,アメリカンコウクウ,カスタム用語
{code}
{code:java}
### synonyms.txt
アメリカン航空,aa,アメリカン
{code}
{code:java}
public class KuromojiAIOOB {

    public static void main(String[] args) {
        try {
            Map<String, String> args1 = new HashMap<>();
            args1.put("mode", "normal");
            args1.put("userDictionary", "userdict.txt");
            args1.put("userDictionaryEncoding", "UTF-8");
            args1.put("discardPunctuation", "false");

            Map<String, String> args2 = new HashMap<>();
            args2.put("synonyms", "synonyms.txt");
            args2.put("format", "solr");
            args2.put("tokenizerFactory", 
"org.apache.lucene.analysis.ja.JapaneseTokenizerFactory");
            args2.put("mode", "normal");
            args2.put("userDictionary", "userdict.txt");
            args2.put("userDictionaryEncoding", "UTF-8");
            args2.put("discardPunctuation", "false");

            CustomAnalyzer analyzer = 
CustomAnalyzer.builder(Paths.get("lucene-8933", "conf"))
                    .withTokenizer(JapaneseTokenizerFactory.class, args1)
                    .addTokenFilter(SynonymGraphFilterFactory.class, args2)
                    .build();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
{code}
This error occurs when building the CustomAnalyzer:
{code:java}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: 
last source index 8 out of bounds for char[7]
        at java.base/java.lang.System.arraycopy(Native Method)
        at 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
        at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
        at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
        at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:174)
        at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:149)
        at 
org.apache.lucene.analysis.custom.CustomAnalyzer$Builder.applyResourceLoader(CustomAnalyzer.java:559)
        at 
org.apache.lucene.analysis.custom.CustomAnalyzer$Builder.addTokenFilter(CustomAnalyzer.java:336)
        at KuromojiAIOOB.main(KuromojiAIOOB.java:31)
{code}

> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
>                 Key: LUCENE-8933
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8933
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>     at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
>     at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
>     at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Reply via email to