[GitHub] [arrow-rs] wjones127 commented on a diff in pull request #4389: Truncate Min/Max values in the Colum Index

via GitHub Thu, 08 Jun 2023 22:26:04 -0700


wjones127 commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223844119



##########
parquet/src/column/writer/mod.rs:
##########
@@ -2220,6 +2228,66 @@ mod tests {
         );
     }
 
+    /// Verify min/max value truncation iin the column index works as expected

Review Comment:
   ```suggestion
       /// Verify min/max value truncation in the column index works as expected
   ```



##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> 
{
             .append_row_count(self.page_metrics.num_buffered_rows as i64);
     }
 
+    fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+        if let Some(max_len) = self.props.minmax_value_truncate_len() {
+            data[0..usize::min(data.len(), max_len)].to_vec()

Review Comment:
   I'm not sure this is quite what we want. First, I don't think this would 
give the behavior described in the example:
   
   > For example, instead of storing ""Blart Versenwald III", a writer may set 
min_values[i]="B", max_values[i]="C". 
   
   
https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
   
   With this implementation, if we truncated to the first byte we would get min 
and max of `"B"`, right?
   
   Second, slicing a string based on bytes means we could be slicing in the 
middle of a multi-byte character. That's not strictly an issue for sorting, but 
could have some annoying consequences when debugging and inspecting statistics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] wjones127 commented on a diff in pull request #4389: Truncate Min/Max values in the Colum Index

Reply via email to