alamb opened a new pull request, #10802: URL: https://github.com/apache/datafusion/pull/10802
## Which issue does this PR close? Part of #10453 and #9929 Follow on to https://github.com/apache/datafusion/pull/10607 ## Rationale for this change The primary benefit of this PR is to start using the new API introduced in https://github.com/apache/datafusion/pull/10537 in the `ParquetExec` path. I plan a follow on project to use the same basic API to extract and prune pages within row groups. The current `ParquetExec` prunes one row group at a time by creating 1 row `ArrayRefs` for each min/max/count in required. It would be better to create a single array with the data for multiple row groups and do a single call the vectorized pruning that `PruningPredicate` does. We recently made a similar change in InfluxDB IOx and saw a significant performance improvement for queries that accessed many row groups I expect this to be a performance improvement, but I am not sure it will be measurable unless there are an extremely large number of row groups in a file. ## What changes are included in this PR? 1. Call `PruningPredicate::prune` once per file (rather than once per row group) 2. Switch to use the `StatisticsExtractor` API introduced from https://github.com/apache/datafusion/pull/10537 3. Update the `StatisticsExtractor` API so it extracts a specified set of row groups rather than all of them The changes to the `StatisticsExtractor` API are to return min/max statistics by different functions rather than enum. This allows re-matching the relevant fields as well as using the the same basic API to extract min/max statistics for pages as well (`page_mins()`, `page_maxs()`, `page_counts()`) etc. ## Are these changes tested? Covered by existing CI tests. I will also run benchmark tests ## Are there any user-facing changes? The `StatisticsExtractor` API has changed, but since this API has not yet been released, this is not strictly a breaking API change <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org