alamb commented on issue #10806:
URL: https://github.com/apache/datafusion/issues/10806#issuecomment-2156787656

   Thank you @marvinlanhenke -- excellent analysis. 
   
   > * a lot of work the StatisticConverter does is already done 
[here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L155-L187)
   
   Yes. It is my eventual goal for all of the code to convert `Index` to 
`ArrayRef` in `page_filter.rs` is gone and `page_filter.rs` only calls 
`StatisticsConverter`. 
   
   To avoid a massive PR, however, I think it makes sense to add new code to 
`StatisticsConverter` for extracting page values, and then when it is complete 
enough switch `page_filter.rs` to use `StatisticsConverter`
   
   > * we already iterate over each row_group individually, extracting a single 
Option<&Index> 
[here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L169-L196)
 and passing it into `prune_pages_per_one_row_group`
   
   Indeed that is how it works today (one row group at a time). I eventually 
hope/plan to apply the same treatment to data page filtering as I did to row 
group filtering in https://github.com/apache/datafusion/pull/10802 (that is, 
make a single call to `PruningPredicate::prune` for the all the remaining row 
groups. 
   
   > Now, my API has to change. I'm wondering how specific it should be? If we 
pass `&Index` as a parameter, I can match the index and extract the statistic 
as done 
[here](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L394-L567).
 However, I'm not sure this is the way to go. We'd simply move the 
`get_min_max_values_for_page_index` method, and basically have no need for the 
StatisticConverter?
   
   let me play around with some options and get back to you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to