alamb opened a new issue, #7580: URL: https://github.com/apache/arrow-rs/issues/7580
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** There are currently **3** places statistics can be written in Parquet files: 1. In the metadata for each ColumnChunk ([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L911-L912)) 2. in the data page header ([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L748-L749)) 3. In the "ColumnIndex" ([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L1174-L1175)) which is part of the so called ["Page Index"](https://github.com/apache/parquet-format/blob/master/PageIndex.md) The level of statistics is controlled by the [EnabledStatistics](https://docs.rs/parquet/latest/parquet/file/properties/enum.EnabledStatistics.html) structure: * `EnabledStatistics::None`: No statistics * `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 above) * `EnabledStatistics::Page`: Stores the statistics for each ColumnChunk, **AND** the data page headers **AND** the ColumnIndex ## Problem `EnabledStatistics::Page` is wasteful because: 1. It stores the page level statistics twice resulting in larger parquet files (see https://github.com/apache/arrow-rs/issues/7579 for an example) 2. The copy in the data page header can not even be accessed via the Rust parquet reader and I don't think it is widely used (it was effectively replaced by the PageIndex) In fact, as @etseidl points out here: https://github.com/apache/arrow-rs/issues/7490#issuecomment-2923682915 > Makes me wonder if we should rethink EnabledStatistics. The Parquet spec actually recommends not writing page level statistics if the page indexes are written. Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header. Specifically the documentation on [PageIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) says: > Readers that support ColumnIndex should not also use page statistics. The only reason to write page-level statistics when writing ColumnIndex structs is to support older readers (not recommended). **Describe the solution you'd like** I would like a way to avoid writing data page header statistics (as they are likely to not be useful to other systems and thus wasteful) **Describe alternatives you've considered** ### Option 1: REdefine `EnabledStatistics::Page` , I personally suggest the following change which would requires no changes for users who have set `EnabledStatistics::Page` and make their parquet files smaller. 1. Redefine `EnabledStatistics::Page`: to mean store statistics for ColumnChunk and PageIndex (not data page headers) 2. Add a new option `WriterProperties::write_data_page_statistics` that would explicitly also write the data page headers as well. We would add a note saying the option is not recommended for the reasons listed above ### Option 2: `EnabledStatistics::ChunkAndIndex` @etseidl suggests adding another variant: > Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header. One challenge with this is that it would require all existing users to know to update their code to stop writing data page headers ### Option 3: `EnabledStatistics` more specific Another alternative is to make `EnabledStatistics` more specific, something like ```rust * `EnabledStatistics::None`: No statistics * `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 above) * `EnabledStatistics::ColumnIndex`: Stores the statistics in the ColumnChunk and ColumnIndex * `EnabledStatistics::ColumnIndexAndPage`: Stores the statistics in the data page headers **AND** the ColumnChunk and the ColumnIndex ``` This would be a breaking API change that would be somewhat annoying to downstream users as they would have to change their code. **Additional context** - https://github.com/apache/arrow-rs/issues/7490 - https://github.com/apache/arrow-rs/issues/7579 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org