emcake opened a new issue, #5037: URL: https://github.com/apache/arrow-rs/issues/5037
**Describe the bug** #4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns. However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715 **To Reproduce** There is a test in https://github.com/delta-io/delta-rs/issues/1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated. **Expected behavior** Matching parquet-mr, the statistics should be truncated as well. **Additional context** Found this when looking into https://github.com/delta-io/delta-rs/issues/1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries. I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
