gsmiller commented on issue #12451: URL: https://github.com/apache/lucene/issues/12451#issuecomment-1644858021
Hmm... I'm not sure yet if this is a new bug or if the tests added by the PR this references just uncovered it. But if I add U+65535 and U+65536 as two single-character terms in an automaton, here's what gets built:  It's representing the two terms as a min/max range on a single transition. I wonder if the bug is in the automaton representation itself, where our min in this case is the largest value that can be represented in utf8 in 3 bytes, and our max is the smallest value that can be represented in 4 utf8 bytes. In this case: * min: U+65535; `11111111 11111111` in binary; `11101111 10111111 10111111` in utf8 * max: U+65536; `00000001 00000000 00000000` in binary; `11110000 10010000 10000000 10000000` in utf8 The value that is getting erroneously accepted by the automaton is the 4 byte representation of all 0s in utf8. I wonder if the range is incorrectly including that value in the edge case where the min/max transition range spans a transition in utf8 byte counts? This is just a guess based on the values that reproduce the bug. I've run out of time for now to dig into it, but will peek again this weekend if I have time or early next week if not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org