alamb commented on issue #10806: URL: https://github.com/apache/datafusion/issues/10806#issuecomment-2156787656
Thank you @marvinlanhenke -- excellent analysis. > * a lot of work the StatisticConverter does is already done [here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L155-L187) Yes. It is my eventual goal for all of the code to convert `Index` to `ArrayRef` in `page_filter.rs` is gone and `page_filter.rs` only calls `StatisticsConverter`. To avoid a massive PR, however, I think it makes sense to add new code to `StatisticsConverter` for extracting page values, and then when it is complete enough switch `page_filter.rs` to use `StatisticsConverter` > * we already iterate over each row_group individually, extracting a single Option<&Index> [here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L169-L196) and passing it into `prune_pages_per_one_row_group` Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in https://github.com/apache/datafusion/pull/10802 (that is, make a single call to `PruningPredicate::prune` for the all the remaining row groups. > Now, my API has to change. I'm wondering how specific it should be? If we pass `&Index` as a parameter, I can match the index and extract the statistic as done [here](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L394-L567). However, I'm not sure this is the way to go. We'd simply move the `get_min_max_values_for_page_index` method, and basically have no need for the StatisticConverter? let me play around with some options and get back to you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org