marvinlanhenke commented on issue #10806: URL: https://github.com/apache/datafusion/issues/10806#issuecomment-2156778662
@alamb I'm currently thinking about how to integrate `StatisticsConverter` with the existing code `prune_pages_in_one_row_group`. This is what I originally had in mind for the converter method: ```Rust pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> { let data_type = self.arrow_field.data_type(); let Some(parquet_column_index) = metadata.column_index() else { return Ok(self.make_null_array(data_type, metadata.row_groups())); }; let Some(parquet_index) = self.parquet_index else { return Ok(self.make_null_array(data_type, metadata.row_groups())); }; let row_group_page_indices = parquet_column_index .into_iter() .map(|x| x.get(parquet_index)); min_page_statistics(Some(data_type), row_group_page_indices) } ``` So we would simply create an iterator for all row groups column indices, match the index and apply the stats_func. Which works - all tests are passing. However, the API, or the integration with `prune_pages_in_one_row_group` feels kind of strange: - a lot of work the StatisticConverter does is already done [here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L155-L187) - we already iterate over each row_group individually, extracting a single Option<&Index> [here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L169-L196) and passing it into `prune_pages_per_one_row_group` Now, my API has to change. I'm wondering how specific it should be? If we pass `&Index` as a parameter, I can match the index and extract the statistic as done [here](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L394-L567). However, I'm not sure this is the way to go. We'd simply move the `get_min_max_values_for_page_index` method, and basically have no need for the StatisticConverter? Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org