etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1883509286
##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
/// Try and increment the the string's bytes from right to left, returning
when the result
/// is a valid UTF8 string. Returns `None` when it can't increment any byte.
fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+ const UTF8_CONTINUATION: u8 = 0x80;
+ const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+ let mut len = data.len();
for idx in (0..data.len()).rev() {
let original = data[idx];
let (byte, overflow) = original.overflowing_add(1);
if !overflow {
data[idx] = byte;
if str::from_utf8(&data).is_ok() {
+ if len != data.len() {
+ data.truncate(len);
+ }
return Some(data);
}
- data[idx] = original;
+ // Incrementing "original" did not yield a valid unicode
character, so it overflowed
+ // its available bits. If it was a continuation byte (b10xxxxxx)
then set to min
+ // continuation (b10000000). Otherwise it was the first byte so
set reset the first
+ // byte back to its original value (so data remains a valid
string) and reduce "len".
+ if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {
Review Comment:
Ok, here's what I came up with (stealing liberally from the code in
https://github.com/apache/datafusion/pull/12978 😄) that avoids as much
translation/allocation as I could manage.
```rust
/// caller guarantees that data.len() > length
fn truncate_and_inc_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
// UTF-8 is max 4 bytes, so start search 3 back from desired length
let lower_bound = length.saturating_sub(3);
let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?;
increment_utf8_str(data.get(..split)?)
}
fn increment_utf8_str(data: &str) -> Option<Vec<u8>> {
for (idx, code_point) in data.char_indices().rev() {
let curr_len = code_point.len_utf8();
let original = code_point as u32;
if let Some(next_char) = char::from_u32(original + 1) {
// do not allow increasing byte width of incremented char
if next_char.len_utf8() == curr_len {
let mut result = data.as_bytes()[..idx + curr_len].to_vec();
next_char.encode_utf8(&mut result[idx..]);
return Some(result);
}
}
}
None
}
```
This winds up being a little faster than the byte-by-byte implementation in
this PR in the latter's best case, and as much as 30X faster for the worst
case.
The check for increasing byte count can be skipped if we don't care about
overshooting a little. This also doesn't include the check for non-characters
as all we care about is a valid UTF-8 bound...the bounds shouldn't be
interpreted by readers anyway as they are not exact. The only case this doesn't
handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16,
so those code points are not valid characters. The byte-by-byte implementation
will increment to the next valid code point of U+E000, whereas this code will
punt and say U+D7FF cannot be incremented. This isn't really a practical
concern, though, since the nearest assigned code point is U+D7FB.
@alamb @findepi if this looks reasonable I'll clean it up and push it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]