AdamGS commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223893073
##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E>
{
.append_row_count(self.page_metrics.num_buffered_rows as i64);
}
+ fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+ if let Some(max_len) = self.props.minmax_value_truncate_len() {
+ data[0..usize::min(data.len(), max_len)].to_vec()
Review Comment:
While I couldn't find a more explicit example/phrasing in the spec as to
should it be "B" or "C", as far as I can tell the Java implementation does
behave that way (See
[here](apache/parquet-mr/pull/481/files#diff-de437a0783662ef1b4e4e8c45b5bb438016dbc57bd8582258351c61415b894edR96-R97))
which I find convincing. Seems to me like that is already a possible issue
without truncating.
Good point about strings and multi-byte character, I'll change that to check
if its a valid rust `str`, and if it is the code will find the longest valid
one under the threshold (as it might fall in a middle of a multi-byte string).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]