[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Tomoko Uchida (JIRA) Thu, 25 Jul 2019 20:27:14 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893288#comment-16893288
 ]


Tomoko Uchida commented on LUCENE-8933:
---------------------------------------

Just for clarification, let me wrap up the problem here.
 - JapaneseTokenizer has "search mode", which break up dictionary tokens 
(surface forms) to small segments and matches the input text to the segments 
(for increasing search recall).
 - The user dictionary of JapaneseTokenizer allows users to specify arbitrary 
segmentation rules in addition to add custom tokens.
 - e.g.: If an user entry {{"aabbcc,aa bb cc,aa bb cc,pos_tag"}} is given, the 
token stream for {{"aabbcc"}} should generate three tokens, {{"aa"}} {{"bb"}} 
{{"cc"}}.
 - The sum of length of segments are expected to be exactly same to the length 
of corresponding surface form (as [~jim.ferenczi] explained). If a segment is 
longer than its surface form, it's a violation against this assumption and 
causes an AIOOB when array copying the region of surface form.

For purpose of format validation, I think it would be better that we check if 
the sum of length of segments is equal to the length of its surface form.
 i.e., we also should not allow such entry {{"aabbcc,a b c,aa bb cc,pos_tag"}} 
even if this does not cause any exceptions.

> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
>                 Key: LUCENE-8933
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8933
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>     at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
>     at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
>     at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
>     at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
>     ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

Reply via email to