etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1882559110
##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
/// Try and increment the the string's bytes from right to left, returning
when the result
/// is a valid UTF8 string. Returns `None` when it can't increment any byte.
fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+ const UTF8_CONTINUATION: u8 = 0x80;
+ const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+ let mut len = data.len();
for idx in (0..data.len()).rev() {
let original = data[idx];
let (byte, overflow) = original.overflowing_add(1);
if !overflow {
data[idx] = byte;
if str::from_utf8(&data).is_ok() {
+ if len != data.len() {
+ data.truncate(len);
+ }
return Some(data);
}
- data[idx] = original;
+ // Incrementing "original" did not yield a valid unicode
character, so it overflowed
+ // its available bits. If it was a continuation byte (b10xxxxxx)
then set to min
+ // continuation (b10000000). Otherwise it was the first byte so
set reset the first
+ // byte back to its original value (so data remains a valid
string) and reduce "len".
+ if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {
Review Comment:
> Trying to increment the last byte seems strange to me (as the byte my be
in the middle of a utf8 multi-byte sequence. I don't fully understand the
implications of setting the last byte to utf8 continuation token. Is that even
valid UTF8 🤔
Yes. First off, we've already truncated to a UTF8 boundary in
`truncate_utf8`, so that's not a concern.
As far as overflow, a continuation byte in UTF8 is `b10xxxxxx`...a six bit
integer. The largest continuation byte is `b10111111`. Incrementing that as a
`u8` produces `b11000000` which is no longer a valid continuation byte. So we
need to behave as if it's overflowed and set to a `0` continuation byte, which
is `b10000000` and then try to increment the next higher byte. Using
`UTF8_CONTINUATION` is clearly valid as the character 'Ä€' (U+0100) has the
UTF-8 encoded value `0xC4 0x80`.
If incrementing the first byte overflows we then punt on that code point and
move on to the next char and start the process over.
> That would also save a call to from_utf8 so it would also likely be faster
too
True, but the conversion from bytes to chars and then back to bytes is not
free either. Let me try both approaches and see if there's a meaningful
difference.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]