rmuir commented on issue #11976:
URL: https://github.com/apache/lucene/issues/11976#issuecomment-1328150137

   I debugged the issue, the problem is not this particular charfilter, instead 
the issue impacts all charfilters.
   
   Think about this single-character string: "㋀"
   Our charfilter turns it into two characters: "1" and "月"
   we would expect the offsets to look like this:
   ```
   first token "1" at rawStartOffset=0, rawEndOffset=1 -> startOffset=0, 
endOffset=1
     correctOffset(0) -> 0
     correctOffset(1) -> 1
   second token "月" at rawStartOffset=1, rawEndOffset=2 -> startOffset=0, 
endOffset=1
     correctOffset(1) -> 0
     correctOffset(2) -> 1
   ```
   
   As you can see, the bug is in the whole charfilter api of "correctOffset". 
Because we need `correctOffset(1) -> 1` for the endoffset of the first token, 
but we need `correctOffset(1) -> 0` for the start offset of the second token.
   
   I can't see any way to fix this, without fixing actual charfilter api (e.g. 
supporting two separate methods: `correctStartOffset()` and 
`correctEndOffset()`)
   
   Sorry for the bad example/explanation. Another example would be a charfilter 
that converts `æ` to `ae`. a's endoffset of 1 needs to remain 1 after 
correction, but e's startoffset of 1 needs to be corrected to a 0.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to