[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533378#comment-17533378 ]
Tomoko Uchida commented on LUCENE-10059: ---------------------------------------- bq. I am also working on a separate PR to apply the fix to the Korean tokenizer. The same check was applied to Nori along with the recent refactoring in both tokenizers https://github.com/apache/lucene/pull/805. The whole change can't be applied to branch_9x, I'll backport the boundary check for Nori to the 9x branch soon; we'll have it in 9.2. > Assertion error in JapaneseTokenizer backtrace > ---------------------------------------------- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 8.8 > Reporter: Anh Dung Bui > Priority: Major > Fix For: 8.x, 9.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org