Re: [PR] Improvements to UTF-8 statistics truncation [arrow-rs]

via GitHub Fri, 13 Dec 2024 18:41:13 -0800


etseidl commented on code in PR #6870:
URL: https://github.com/apache/arrow-rs/pull/6870#discussion_r1884741728



##########
parquet/src/column/writer/mod.rs:
##########
@@ -878,24 +878,44 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, 
E> {
         }
     }
 
+    /// Returns `true` if this column's logical type is a UTF-8 string.
+    fn is_utf8(&self) -> bool {
+        self.get_descriptor().logical_type() == Some(LogicalType::String)
+            || self.get_descriptor().converted_type() == ConvertedType::UTF8
+    }
+
     fn truncate_min_value(&self, truncation_length: Option<usize>, data: 
&[u8]) -> (Vec<u8>, bool) {
         truncation_length
             .filter(|l| data.len() > *l)
-            .and_then(|l| match str::from_utf8(data) {
-                Ok(str_data) => truncate_utf8(str_data, l),
-                Err(_) => Some(data[..l].to_vec()),
-            })
+            .and_then(|l|
+                // don't do extra work if this column isn't UTF-8
+                if self.is_utf8() {
+                    match str::from_utf8(data) {
+                        Ok(str_data) => truncate_utf8(str_data, l),
+                        Err(_) => Some(data[..l].to_vec()),

Review Comment:
   Hmm, good point. The old code simply tried utf first, and then fell back. 
Here we're actually expecting valid UTF8 so perhaps it's better to return an 
error. I'd hope some string validation was done before getting this far. I'll 
think on this some more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improvements to UTF-8 statistics truncation [arrow-rs]

Reply via email to