[GitHub] [lucene] mikemccand commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode

via GitHub Thu, 27 Jul 2023 06:44:30 -0700


mikemccand commented on issue #12458:
URL: https://github.com/apache/lucene/issues/12458#issuecomment-1653658734


   The code does in fact seem to try to handle this case, when the start/end 
UTF-8 have different numbers of bytes, in the final `else` clause in the 
confusing `build` method:
   
   ```
       } else {
   
         // start
         start(start, end, startUTF8, upto, true);
   
         // possibly middle, spanning multiple num bytes
         int byteCount = 1 + startUTF8.len - upto;
         final int limit = endUTF8.len - upto;
         while (byteCount < limit) {
           // wasteful: we only need first byte, and, we should
           // statically encode this first byte:
           tmpUTF8a.set(startCodes[byteCount - 1]);
           tmpUTF8b.set(endCodes[byteCount - 1]);
           all(start, end, tmpUTF8a.byteAt(0), tmpUTF8b.byteAt(0), tmpUTF8a.len 
- 1);
           byteCount++;
         }
   ```
   
   Here the bug must lurk!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode

Reply via email to