Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

via GitHub Thu, 12 Dec 2024 06:15:50 -0800


alamb commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1882219998



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
 /// Try and increment the the string's bytes from right to left, returning 
when the result
 /// is a valid UTF8 string. Returns `None` when it can't increment any byte.
 fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+    const UTF8_CONTINUATION: u8 = 0x80;
+    const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+    let mut len = data.len();
     for idx in (0..data.len()).rev() {
         let original = data[idx];
         let (byte, overflow) = original.overflowing_add(1);
         if !overflow {
             data[idx] = byte;
             if str::from_utf8(&data).is_ok() {
+                if len != data.len() {
+                    data.truncate(len);
+                }
                 return Some(data);
             }
-            data[idx] = original;
+            // Incrementing "original" did not yield a valid unicode 
character, so it overflowed
+            // its available bits. If it was a continuation byte (b10xxxxxx) 
then set to min
+            // continuation (b10000000). Otherwise it was the first byte so 
set reset the first
+            // byte back to its original value (so data remains a valid 
string) and reduce "len".
+            if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {

Review Comment:
   > If this isn't super perf critical, can we switch over to operating on 
codepoints? (i assume this is for stats only, so not a hot path?)
   
   I agree switching to arithmetic on codepoints would be easier to reason 
about. 
   
    I double checked and this is called while writing stats (at most once per 
page, and once per column chunk ):
   
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
   
   



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
 /// Try and increment the the string's bytes from right to left, returning 
when the result
 /// is a valid UTF8 string. Returns `None` when it can't increment any byte.
 fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+    const UTF8_CONTINUATION: u8 = 0x80;
+    const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+    let mut len = data.len();
     for idx in (0..data.len()).rev() {
         let original = data[idx];
         let (byte, overflow) = original.overflowing_add(1);
         if !overflow {
             data[idx] = byte;
             if str::from_utf8(&data).is_ok() {
+                if len != data.len() {
+                    data.truncate(len);
+                }
                 return Some(data);
             }
-            data[idx] = original;
+            // Incrementing "original" did not yield a valid unicode 
character, so it overflowed
+            // its available bits. If it was a continuation byte (b10xxxxxx) 
then set to min
+            // continuation (b10000000). Otherwise it was the first byte so 
set reset the first
+            // byte back to its original value (so data remains a valid 
string) and reduce "len".
+            if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {

Review Comment:
   > If this isn't super perf critical, can we switch over to operating on 
codepoints? (i assume this is for stats only, so not a hot path?)
   
   I agree switching to arithmetic on codepoints would be easier to reason 
about. 
   
    I double checked and this function is only called while writing stats (at 
most once per page, and once per column chunk ):
   
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

Reply via email to