wjones127 commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223844119
##########
parquet/src/column/writer/mod.rs:
##########
@@ -2220,6 +2228,66 @@ mod tests {
);
}
+ /// Verify min/max value truncation iin the column index works as expected
Review Comment:
```suggestion
/// Verify min/max value truncation in the column index works as expected
```
##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E>
{
.append_row_count(self.page_metrics.num_buffered_rows as i64);
}
+ fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+ if let Some(max_len) = self.props.minmax_value_truncate_len() {
+ data[0..usize::min(data.len(), max_len)].to_vec()
Review Comment:
I'm not sure this is quite what we want. First, I don't think this would
give the behavior described in the example:
> For example, instead of storing ""Blart Versenwald III", a writer may set
min_values[i]="B", max_values[i]="C".
https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
With this implementation, if we truncated to the first byte we would get min
and max of `"B"`, right?
Second, slicing a string based on bytes means we could be slicing in the
middle of a multi-byte character. That's not strictly an issue for sorting, but
could have some annoying consequences when debugging and inspecting statistics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]