AdamGS commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223893073


##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> 
{
             .append_row_count(self.page_metrics.num_buffered_rows as i64);
     }
 
+    fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+        if let Some(max_len) = self.props.minmax_value_truncate_len() {
+            data[0..usize::min(data.len(), max_len)].to_vec()

Review Comment:
   While I couldn't find a more explicit example/phrasing in the spec as to 
should it be "B" or "C", as far as I can tell the Java implementation does 
behave that way (See 
[here](apache/parquet-mr/pull/481/files#diff-de437a0783662ef1b4e4e8c45b5bb438016dbc57bd8582258351c61415b894edR96-R97))
 which I find convincing. Seems to me like that is already a possible issue 
without truncating, I think just +1 the last character won't work here due to 
overflowing, and if we want "printable" strings that also has to extend to 
multi-byte characters, which brings me to the second point.
   
   I'll change that to check if its a valid rust `str`, and if it is the code 
will find the longest valid one under the threshold (as it might fall in a 
middle of a multi-byte string).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to