alamb commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1882219998
##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
/// Try and increment the the string's bytes from right to left, returning
when the result
/// is a valid UTF8 string. Returns `None` when it can't increment any byte.
fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+ const UTF8_CONTINUATION: u8 = 0x80;
+ const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+ let mut len = data.len();
for idx in (0..data.len()).rev() {
let original = data[idx];
let (byte, overflow) = original.overflowing_add(1);
if !overflow {
data[idx] = byte;
if str::from_utf8(&data).is_ok() {
+ if len != data.len() {
+ data.truncate(len);
+ }
return Some(data);
}
- data[idx] = original;
+ // Incrementing "original" did not yield a valid unicode
character, so it overflowed
+ // its available bits. If it was a continuation byte (b10xxxxxx)
then set to min
+ // continuation (b10000000). Otherwise it was the first byte so
set reset the first
+ // byte back to its original value (so data remains a valid
string) and reduce "len".
+ if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {
Review Comment:
> If this isn't super perf critical, can we switch over to operating on
codepoints? (i assume this is for stats only, so not a hot path?)
I agree switching to arithmetic on codepoints would be easier to reason
about.
I double checked and this is called while writing stats (at most once per
page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
##########
parquet/src/column/writer/mod.rs:
##########
@@ -1444,15 +1444,31 @@ fn increment(mut data: Vec<u8>) -> Option<Vec<u8>> {
/// Try and increment the the string's bytes from right to left, returning
when the result
/// is a valid UTF8 string. Returns `None` when it can't increment any byte.
fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
+ const UTF8_CONTINUATION: u8 = 0x80;
+ const UTF8_CONTINUATION_MASK: u8 = 0xc0;
+
+ let mut len = data.len();
for idx in (0..data.len()).rev() {
let original = data[idx];
let (byte, overflow) = original.overflowing_add(1);
if !overflow {
data[idx] = byte;
if str::from_utf8(&data).is_ok() {
+ if len != data.len() {
+ data.truncate(len);
+ }
return Some(data);
}
- data[idx] = original;
+ // Incrementing "original" did not yield a valid unicode
character, so it overflowed
+ // its available bits. If it was a continuation byte (b10xxxxxx)
then set to min
+ // continuation (b10000000). Otherwise it was the first byte so
set reset the first
+ // byte back to its original value (so data remains a valid
string) and reduce "len".
+ if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {
Review Comment:
> If this isn't super perf critical, can we switch over to operating on
codepoints? (i assume this is for stats only, so not a hot path?)
I agree switching to arithmetic on codepoints would be easier to reason
about.
I double checked and this function is only called while writing stats (at
most once per page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]