hanbj opened a new pull request, #16246:
URL: https://github.com/apache/lucene/pull/16246

   ## Description
   
     `BaseFragmentsBuilder.makeFragment()` throws 
`StringIndexOutOfBoundsException`
     when token offsets overlap. This is common with CJK analyzers (e.g. CJK 
bigram,
     ik_max_word) that produce multiple granularity tokens for the same text 
span.
   
     For example, "中华人民共和国" produces tokens: [0,2), [0,4), [1,3), [2,4), [2,7), 
[4,6), [4,7).
   
     The root cause is the sequential `srcIndex` cursor assumes tokens are 
non-overlapping.
     When the previous token's endOffset > next token's startOffset,
     `substring(srcIndex, toffsStart)` is called with begin > end.
   
     ## Fix
   
     Rather than skipping overlapping tokens entirely, clip them to their
     non-overlapping tail so the full matched region is highlighted:
     - Fully contained tokens → skip
     - Partially overlapping tokens → clip startOffset to srcIndex
     - Invalid offsets (inverted, out-of-bounds) → skip
   
     ## Test plan
   
     - [x] Added `TestBaseFragmentsBuilderOverlappingOffsets` with 7 test cases
     - [x] Verified the test reproduces the crash without the fix
     - [x] Verified all tests pass with the fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to