Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

via GitHub Thu, 12 Dec 2024 23:56:08 -0800


etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1883509286



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
 /// Try and increment the the string's bytes from right to left, returning 
when the result
 /// is a valid UTF8 string. Returns `None` when it can't increment any byte.
 fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+    const UTF8_CONTINUATION: u8 = 0x80;
+    const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+    let mut len = data.len();
     for idx in (0..data.len()).rev() {
         let original = data[idx];
         let (byte, overflow) = original.overflowing_add(1);
         if !overflow {
             data[idx] = byte;
             if str::from_utf8(&data).is_ok() {
+                if len != data.len() {
+                    data.truncate(len);
+                }
                 return Some(data);
             }
-            data[idx] = original;
+            // Incrementing "original" did not yield a valid unicode 
character, so it overflowed
+            // its available bits. If it was a continuation byte (b10xxxxxx) 
then set to min
+            // continuation (b10000000). Otherwise it was the first byte so 
set reset the first
+            // byte back to its original value (so data remains a valid 
string) and reduce "len".
+            if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {

Review Comment:
   Ok, here's what I came up with (stealing liberally from the code in 
https://github.com/apache/datafusion/pull/12978 😄) that avoids as much 
translation/allocation as I could manage.
   ```rust
   /// caller guarantees that data.len() > length
   fn truncate_and_inc_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
       // UTF-8 is max 4 bytes, so start search 3 back from desired length
       let lower_bound = length.saturating_sub(3);
       let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?;
       increment_utf8_str(data.get(..split)?)
   }
   
   fn increment_utf8_str(data: &str) -> Option<Vec<u8>> {
       for (idx, code_point) in data.char_indices().rev() {
           let curr_len = code_point.len_utf8();
           let original = code_point as u32;
           if let Some(next_char) = char::from_u32(original + 1) {
               // do not allow increasing byte width of incremented char
               if next_char.len_utf8() == curr_len {
                   let mut result = data.as_bytes()[..idx + curr_len].to_vec();
                   next_char.encode_utf8(&mut result[idx..]);
                   return Some(result);
               }
           }
       }
       None
   }
   ```
   This winds up being a little faster than the byte-by-byte implementation in 
this PR in the latter's best case, and as much as 30X faster for the worst 
case. 
   
   The check for increasing byte count can be skipped if we don't care about 
overshooting a little. This also doesn't include the check for non-characters 
as all we care about is a valid UTF-8 bound...the bounds shouldn't be 
interpreted by readers anyway as they are not exact. The only case this doesn't 
handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16, 
so those code points are not valid characters. The byte-by-byte implementation 
will increment to the next valid code point of U+E000, whereas this code will 
punt and say U+D7FF cannot be incremented. This isn't really a practical 
concern, though, since the nearest assigned code point is U+D7FB.
   
   @alamb @findepi if this looks reasonable I'll clean it up and push it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix some edge cases in UTF-8 incrementing [arrow-rs]

Reply via email to