AdamGS commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223893073


##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> 
{
             .append_row_count(self.page_metrics.num_buffered_rows as i64);
     }
 
+    fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+        if let Some(max_len) = self.props.minmax_value_truncate_len() {
+            data[0..usize::min(data.len(), max_len)].to_vec()

Review Comment:
   While I couldn't find a more explicit example/phrasing in the spec as to 
should it be "B" or "C", as far as I can tell the Java implementation does 
behave that way (See 
[here](apache/parquet-mr/pull/481/files#diff-de437a0783662ef1b4e4e8c45b5bb438016dbc57bd8582258351c61415b894edR96-R97))
 which I find convincing. Seems to me like that is already a possible issue 
without truncating.
   
   Good point about strings and multi-byte character, I'll change that to check 
if its a valid rust `str`, and if it is the code will find the longest valid 
one under the threshold (as it might fall in a middle of a multi-byte string).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to