[I] Efficiently and corerctly Extract Page Index statistics into `ArrayRef`s [datafusion]


alamb opened a new issue, #10806:
URL: https://github.com/apache/datafusion/issues/10806


   ### Is your feature request related to a problem or challenge?
   
   Related to https://github.com/apache/datafusion/issues/10453
   
   There are at least two types of statistics stored in Parquet files
   
   1. `ColumnChunk` level statistics (a min/max/null_count per column per row 
group): 
[`RowGroupMetadata`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaData.html)
 --> 
[ColumnChunkMetaData](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html)
 --> 
[Option](https://doc.rust-lang.org/nightly/core/option/enum.Option.html)<&[Statistics](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html)>
   2. "Page Index" statistics (stored per page, may be more than one page per 
column per row group): 
[ColumnChunkMetaData](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html)
 --> 
[read_columns_indexes](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/fn.read_columns_indexes.html#)
 --> 
[Vec](https://doc.rust-lang.org/nightly/alloc/vec/struct.Vec.html)<[Index](https://docs.rs/parquet/latest/parquet/file/page_index/index/enum.Index.html)>
   
   As part of  https://github.com/apache/datafusion/issues/10453 we have pulled 
conversion of the `ColumnChunk` level statistics into `StatisticsConverter` and 
https://github.com/apache/datafusion/pull/10802 prunes the row groups using 
this API
   
   It would be good to apply the same treatment to the statistics in the page 
index
   
   ### Describe the solution you'd like
   
   1. Add a clear API to efficiently extract page statistics outside of 
DataFusion
   2. Ensure that API is well tested
   3. Ensure the API is fast
   
   ### Describe alternatives you've considered
   
   1. Move / refactor the code to extract `ArrayRef` from Index in page_filter 
([source 
link](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L394-L567))
 to `StatisticsConverter` 
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L363))
   2. Update the tests in arrow_statistics  
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/tests/parquet/arrow_statistics.rs#L180-L237))
 to also verify that the page statistics are correct (I believe the page 
min/maxes should be the same as the row group min/maxes)
   3. Update the parquet code `prune_pages_in_one_row_group` 
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L301-L365))
 to use the new `StatisticsExtractor` code
   4. Update the benchmark 
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152))
 for extracting page statistics and use that to ensure the statistics 
extraction code is reasonably performant
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Efficiently and corerctly Extract Page Index statistics into `ArrayRef`s [datafusion]

Reply via email to