alamb opened a new issue, #10806: URL: https://github.com/apache/datafusion/issues/10806
### Is your feature request related to a problem or challenge? Related to https://github.com/apache/datafusion/issues/10453 There are at least two types of statistics stored in Parquet files 1. `ColumnChunk` level statistics (a min/max/null_count per column per row group): [`RowGroupMetadata`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaData.html) --> [ColumnChunkMetaData](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html) --> [Option](https://doc.rust-lang.org/nightly/core/option/enum.Option.html)<&[Statistics](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html)> 2. "Page Index" statistics (stored per page, may be more than one page per column per row group): [ColumnChunkMetaData](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html) --> [read_columns_indexes](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/fn.read_columns_indexes.html#) --> [Vec](https://doc.rust-lang.org/nightly/alloc/vec/struct.Vec.html)<[Index](https://docs.rs/parquet/latest/parquet/file/page_index/index/enum.Index.html)> As part of https://github.com/apache/datafusion/issues/10453 we have pulled conversion of the `ColumnChunk` level statistics into `StatisticsConverter` and https://github.com/apache/datafusion/pull/10802 prunes the row groups using this API It would be good to apply the same treatment to the statistics in the page index ### Describe the solution you'd like 1. Add a clear API to efficiently extract page statistics outside of DataFusion 2. Ensure that API is well tested 3. Ensure the API is fast ### Describe alternatives you've considered 1. Move / refactor the code to extract `ArrayRef` from Index in page_filter ([source link](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L394-L567)) to `StatisticsConverter` ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L363)) 2. Update the tests in arrow_statistics ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/tests/parquet/arrow_statistics.rs#L180-L237)) to also verify that the page statistics are correct (I believe the page min/maxes should be the same as the row group min/maxes) 3. Update the parquet code `prune_pages_in_one_row_group` ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L301-L365)) to use the new `StatisticsExtractor` code 4. Update the benchmark ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152)) for extracting page statistics and use that to ensure the statistics extraction code is reasonably performant ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org