alamb opened a new issue, #7490: URL: https://github.com/apache/arrow-rs/issues/7490
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header This can lead to pathalogical cases such as described in - https://github.com/apache/arrow-rs/issues/7489 It is possible to control the maximum size of the values using [`WriterPropertiesBuilder::set_statistics_truncate_length`](https://arrow.apache.org/rust/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_truncate_length) however this value currently defaults to `None` (unlimited) I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning. **Describe the solution you'd like** I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases **Describe alternatives you've considered** I would propose picking a value like `128` that is long enough to capture all primitive data types and "sort" strings. We can (and should) also document the default better **Additional context** - related to https://github.com/apache/arrow-rs/issues/7489 - https://github.com/apache/arrow/issues/46404 - https://github.com/kylebarron/arro3/issues/324 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
