Re: [PR] Improvements to UTF-8 statistics truncation [arrow-rs]

via GitHub Sat, 14 Dec 2024 09:13:58 -0800


etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1885125823



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1418,13 +1438,51 @@ fn compare_greater_byte_array_decimals(a: &[u8], b: 
&[u8]) -> bool {
     (a[1..]) > (b[1..])
 }
 
-/// Truncate a UTF8 slice to the longest prefix that is still a valid UTF8 
string,
-/// while being less than `length` bytes and non-empty
+/// Truncate a UTF-8 slice to the longest prefix that is still a valid UTF-8 
string,
+/// while being less than `length` bytes and non-empty. Returns `None` if 
truncation
+/// is not possible within those constraints.
+///
+/// The caller guarantees that data.len() > length.
 fn truncate_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
     let split = (1..=length).rfind(|x| data.is_char_boundary(*x))?;
     Some(data.as_bytes()[..split].to_vec())
 }
 
+/// Truncate a UTF-8 slice and increment it's final character. The returned 
value is the
+/// longest such slice that is still a valid UTF-8 string while being less 
than `length`
+/// bytes and non-empty. Returns `None` if no such transformation is possible.
+///
+/// The caller guarantees that data.len() > length.
+fn truncate_and_increment_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
+    // UTF-8 is max 4 bytes, so start search 3 back from desired length
+    let lower_bound = length.saturating_sub(3);
+    let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?;
+    increment_utf8(data.get(..split)?)
+}
+
+/// Increment the final character in a UTF-8 string in such a way that the 
returned result
+/// is still a valid UTF-8 string. The returned string may be shorter than the 
input if the
+/// last character(s) cannot be incremented (due to overflow or producing 
invalid code points).
+/// Returns `None` if the string cannot be incremented.
+///
+/// Note that this implementation will not promote an N-byte code point to 
(N+1) bytes.
+fn increment_utf8(data: &str) -> Option<Vec<u8>> {
+    for (idx, code_point) in data.char_indices().rev() {
+        let curr_len = code_point.len_utf8();
+        let original = code_point as u32;
+        if let Some(next_char) = char::from_u32(original + 1) {
+            // do not allow increasing byte width of incremented char

Review Comment:
   Yes, there's no way incrementing a valid character will overflow a u32, so 
we can assume it only grows. I suppose we could change the test to something 
like `if idx + next_char.len_utf8() <= data.len()`. That way if we've already 
removed some characters, creating space under the truncation limit, we can then 
afford the character growing by a byte. If we want to take that tack, then we 
should probably pass in the truncation length as we may have already gone under 
the limit due to not splitting a character. 
   
   I think that's maybe being too fussy and would prefer to keep this simple. 
Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improvements to UTF-8 statistics truncation [arrow-rs]

Reply via email to