[GitHub] [lucene] gsmiller commented on issue #12451: Interesting TestStringsToAutomaton failure

via GitHub Thu, 20 Jul 2023 18:17:32 -0700


gsmiller commented on issue #12451:
URL: https://github.com/apache/lucene/issues/12451#issuecomment-1644858021

Hmm... I'm not sure yet if this is a new bug or if the tests added by the PR
this references just uncovered it. But if I add U+65535 and U+65536 as two
single-character terms in an automaton, here's what gets built:

![out](https://github.com/apache/lucene/assets/16479560/265e681c-794c-4677-a88d-923c3e79d594)
It's representing the two terms as a min/max range on a single transition.

I wonder if the bug is in the automaton representation itself, where our min
in this case is the largest value that can be represented in utf8 in 3 bytes,
and our max is the smallest value that can be represented in 4 utf8 bytes. In
this case:
* min: U+65535; `11111111 11111111` in binary; `11101111 10111111 10111111`
in utf8
* max: U+65536; `00000001 00000000 00000000` in binary; `11110000 10010000
10000000 10000000` in utf8

The value that is getting erroneously accepted by the automaton is the 4
byte representation of all 0s in utf8. I wonder if the range is incorrectly
including that value in the edge case where the min/max transition range spans
a transition in utf8 byte counts? This is just a guess based on the values that
reproduce the bug. I've run out of time for now to dig into it, but will peek
again this weekend if I have time or early next week if not.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on issue #12451: Interesting TestStringsToAutomaton failure

Reply via email to