[I] Some way to Avoid writing redundant statistics into data page headers [arrow-rs]

via GitHub Sat, 31 May 2025 13:12:48 -0700


alamb opened a new issue, #7580:
URL: https://github.com/apache/arrow-rs/issues/7580


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   There are currently **3** places statistics can be written in Parquet files:
   1. In the metadata for each ColumnChunk 
([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L911-L912))
   2. in the data page header 
([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L748-L749))
   3. In the "ColumnIndex" 
([source](https://github.com/apache/parquet-format/blob/f1fd3b9171aec7a7f0106e0203caef88d17dda82/src/main/thrift/parquet.thrift#L1174-L1175))
 which is part of the so called ["Page 
Index"](https://github.com/apache/parquet-format/blob/master/PageIndex.md)
   
   The level of statistics is controlled by the 
[EnabledStatistics](https://docs.rs/parquet/latest/parquet/file/properties/enum.EnabledStatistics.html)
 structure:
   
   * `EnabledStatistics::None`: No statistics
   * `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 
above)
   * `EnabledStatistics::Page`: Stores the statistics for each ColumnChunk, 
**AND** the data page headers **AND** the ColumnIndex
   
   ## Problem
   `EnabledStatistics::Page` is wasteful because:
   1. It stores the page level statistics twice resulting in larger parquet 
files (see https://github.com/apache/arrow-rs/issues/7579 for an example)
   2. The copy in the data page header can not even be accessed via the Rust 
parquet reader and I don't think it is widely used (it was effectively replaced 
by the PageIndex)
   
   In fact, as @etseidl  points out here: 
https://github.com/apache/arrow-rs/issues/7490#issuecomment-2923682915
   
   > Makes me wonder if we should rethink EnabledStatistics. The Parquet spec 
actually recommends not writing page level statistics if the page indexes are 
written. Perhaps we could add something like EnabledStatistics::ChunkAndIndex 
to write chunk level and offset/column indexes but no statistics in the page 
header.
   
   Specifically the documentation on 
[PageIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) 
says:
   
   > Readers that support ColumnIndex should not also use page statistics. The 
only reason to write page-level statistics when writing ColumnIndex structs is 
to support older readers (not recommended).
   
   
   **Describe the solution you'd like**
   I would like a way to avoid writing data page header statistics (as they are 
likely to not be useful to other systems and thus wasteful)
   
   **Describe alternatives you've considered**
   
   ### Option 1: REdefine `EnabledStatistics::Page` , 
   I personally suggest the following change which would requires no changes 
for users who have set `EnabledStatistics::Page` and make their parquet files 
smaller.
   
   1. Redefine `EnabledStatistics::Page`: to mean store statistics for  
ColumnChunk and PageIndex (not data page headers)
   2. Add a new option `WriterProperties::write_data_page_statistics` that 
would explicitly also write the data page headers as well. We would add a note 
saying the option is not recommended for the reasons listed above
   
   ### Option 2: `EnabledStatistics::ChunkAndIndex`
   
   @etseidl suggests adding another variant:
   
   > Perhaps we could add something like EnabledStatistics::ChunkAndIndex to 
write chunk level and offset/column indexes but no statistics in the page 
header.
   
   One challenge with this is that it would require all existing users to know 
to update their code to stop writing data page headers
   
   ### Option 3: `EnabledStatistics` more specific
   
   Another alternative is to  make `EnabledStatistics` more specific, something 
like
   
   ```rust
   * `EnabledStatistics::None`: No statistics
   * `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 
above)
   * `EnabledStatistics::ColumnIndex`: Stores the statistics in the ColumnChunk 
and ColumnIndex
   * `EnabledStatistics::ColumnIndexAndPage`: Stores the statistics in the data 
page headers **AND** the ColumnChunk and the ColumnIndex
   ```
   
   This would be a breaking API change that would be somewhat annoying to 
downstream users as they would have to change their code. 
   
   
   
   **Additional context**
   - https://github.com/apache/arrow-rs/issues/7490
   - https://github.com/apache/arrow-rs/issues/7579


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Some way to Avoid writing redundant statistics into data page headers [arrow-rs]

Reply via email to