[I] Binary columns do not receive truncated statistics [arrow-rs]

via GitHub Sat, 04 Nov 2023 15:51:49 -0700


emcake opened a new issue, #5037:
URL: https://github.com/apache/arrow-rs/issues/5037

**Describe the bug**
#4389 introduced truncation on column indices for binary columns, where the
min/max values for a binary column may be arbitrarily large. As noted, this
matches the behaviour in parquet-mr for shortening columns.

However, the value in the statistics is written un-truncated. This differs
from the behaviour of parquet-mr where the statistics are truncated too:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715

**To Reproduce**
There is a test in https://github.com/delta-io/delta-rs/issues/1805 which
demonstrates this, but in general write a parquet file with a long binary
column and observe that the stats for that column are not truncated.

**Expected behavior**
Matching parquet-mr, the statistics should be truncated as well.

**Additional context**
Found this when looking into
https://github.com/delta-io/delta-rs/issues/1805. delta-rs uses the column
stats to serialize into the delta log, which leads to very bloated entries.

I think it is sufficient to just call truncate_min_value/truncate_max_value
when creating the column metadata here:
https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859
but I don't know enough about the internals of arrow to know if that change is
correct.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Binary columns do not receive truncated statistics [arrow-rs]

Reply via email to