Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

via GitHub Thu, 12 Dec 2024 09:11:50 -0800


etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1882559110



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
 /// Try and increment the the string's bytes from right to left, returning 
when the result
 /// is a valid UTF8 string. Returns `None` when it can't increment any byte.
 fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+    const UTF8_CONTINUATION: u8 = 0x80;
+    const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+    let mut len = data.len();
     for idx in (0..data.len()).rev() {
         let original = data[idx];
         let (byte, overflow) = original.overflowing_add(1);
         if !overflow {
             data[idx] = byte;
             if str::from_utf8(&data).is_ok() {
+                if len != data.len() {
+                    data.truncate(len);
+                }
                 return Some(data);
             }
-            data[idx] = original;
+            // Incrementing "original" did not yield a valid unicode 
character, so it overflowed
+            // its available bits. If it was a continuation byte (b10xxxxxx) 
then set to min
+            // continuation (b10000000). Otherwise it was the first byte so 
set reset the first
+            // byte back to its original value (so data remains a valid 
string) and reduce "len".
+            if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {

Review Comment:
   > Trying to increment the last byte seems strange to me (as the byte my be 
in the middle of a utf8 multi-byte sequence. I don't fully understand the 
implications of setting the last byte to utf8 continuation token. Is that even 
valid UTF8 🤔
   
   Yes. First off, we've already truncated to a UTF8 boundary in 
`truncate_utf8`, so that's not a concern. 
   
   As far as overflow, a continuation byte in UTF8 is `b10xxxxxx`...a six bit 
integer. The largest continuation byte is `b10111111`. Incrementing that as a 
`u8` produces `b11000000` which is no longer a valid continuation byte. So we 
need to behave as if it's overflowed and set to a `0` continuation byte, which 
is `b10000000` and then try to increment the next higher byte. Using 
`UTF8_CONTINUATION` is clearly valid as the character 'Ā' (U+0100) has the 
UTF-8 encoded value `0xC4 0x80`.
   
   If incrementing the first byte overflows we then punt on that code point and 
move on to the next char and start the process over.
   
   > That would also save a call to from_utf8 so it would also likely be faster 
too
   
   True, but the conversion from bytes to chars and then back to bytes is not 
free either. Let me try both approaches and see if there's a meaningful 
difference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

Reply via email to