emcake opened a new issue, #5037:
URL: https://github.com/apache/arrow-rs/issues/5037

   **Describe the bug**
   #4389 introduced truncation on column indices for binary columns, where the 
min/max values for a binary column may be arbitrarily large. As noted, this 
matches the behaviour in parquet-mr for shortening columns.
   
   However, the value in the statistics is written un-truncated. This differs 
from the behaviour of parquet-mr where the statistics are truncated too: 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715
   
   **To Reproduce**
   There is a test in https://github.com/delta-io/delta-rs/issues/1805 which 
demonstrates this, but in general write a parquet file with a long binary 
column and observe that the stats for that column are not truncated.
   
   **Expected behavior**
   Matching parquet-mr, the statistics should be truncated as well.
   
   **Additional context**
   Found this when looking into 
https://github.com/delta-io/delta-rs/issues/1805. delta-rs uses the column 
stats to serialize into the delta log, which leads to very bloated entries.
   
   I think it is sufficient to just call truncate_min_value/truncate_max_value 
when creating the column metadata here: 
https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859
 but I don't know enough about the internals of arrow to know if that change is 
correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to