chenhh021 opened a new issue, #775:
URL: https://github.com/apache/lucenenet/issues/775

   I found this bug when searching for solution from Lucene for the last issue 
I reported. I find it is reproduceable in lucenenet even the original report is 
for lucene 8.x. The descreptions are from [ 
LUCENE-10059](https://issues.apache.org/jira/browse/LUCENE-10059).  
   
   In a rare case an AssertionException will be thrown in the backtrace step of 
JapaneseTokenizer.
   
   If there is a text span of length 1024 (determined by MAX_BACKTRACE_GAP) 
where the regular backtrace is not called, a forced backtrace will be applied. 
If the partially best path at this point happens to end at the last pos, and 
since there is always a final backtrace applied at the end, the final backtrace 
will try to backtrace from and to the same position, causing an AssertionError 
in RollingCharBuffer.get() when it tries to generate an empty buffer.
   
   It can be reproduced by adding some code in 
[TestJapaneseTokenizer](https://github.com/apache/lucenenet/blob/11806edbdaa4686b73806066165f27cbbd9aef3b/src/Lucene.Net.Tests.Analysis.Kuromoji/TestJapaneseTokenizer.cs):
   
   ```c# 
   public void TestEmptyBacktrace()  {
               String text = "";
   
               // since the max backtrace gap ({@link 
JapaneseTokenizer#MAX_BACKTRACE_GAP)
               // is set to 1024, we want the first 1023 characters to generate 
multiple paths
               // so that the regular backtrace is not executed.
               for (int i = 0; i < 1023; i++) {
                   text += "あ";
               }
   
               // and the last 2 characters to be a valid word so that they
               // will end-up together
               text += "手紙";
   
               IList<String> outputs = new List<String>();
               for (int i = 0; i < 511; i++) {
                   outputs.Add("ああ");
               }
               outputs.Add("あ");
               outputs.Add("手紙");
   
               AssertAnalyzesTo(analyzer, text, outputs.ToArray());
    }
   ```
   
   This can be fixed by stop backtrace when the from and to pos are the same. I 
will create a PR that port the [Lucene 
patch](https://github.com/apache/lucene/pull/254/files#diff-519e00792a2747b10ceb9bb643057485e79135502b5869ea6f7ea284e7dafce6).
   
   The PR may break the parity with Lucene 4.8 and may not get accepted. But I 
decide to create it in case that someone meet the same problem.
   
   BTW, I find several other Lucene bugs exist in Lucene.net. I've done most of 
work that port the patches and will create PRs just for reference. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to