alamb opened a new pull request, #10802:
URL: https://github.com/apache/datafusion/pull/10802

   ## Which issue does this PR close?
   
   
   Part of #10453 and #9929
   
   Follow on to https://github.com/apache/datafusion/pull/10607
   
   ## Rationale for this change
   
   The primary benefit of this PR is to start using the new API introduced in 
https://github.com/apache/datafusion/pull/10537 in the `ParquetExec` path. I 
plan a follow on project to use the same basic API to extract and prune pages 
within row groups.
   
   The current `ParquetExec` prunes one row group at a time by creating 1 row 
`ArrayRefs` for each min/max/count in required.  It would be better to create a 
single array with the data for multiple row groups and do a single call  the 
vectorized pruning that `PruningPredicate` does. 
   
   We recently made a similar change in InfluxDB IOx and saw a significant 
performance improvement for queries that accessed many row groups
   
   I expect this to be a performance improvement, but I am not sure it will be 
measurable unless there are an extremely large number of row groups in a file.
   
   ## What changes are included in this PR?
   
   1. Call `PruningPredicate::prune` once per file (rather than once per row 
group)
   2. Switch to use  the `StatisticsExtractor` API introduced from 
https://github.com/apache/datafusion/pull/10537
   3. Update the `StatisticsExtractor` API so it extracts a specified set of 
row groups rather than all of them
   
   
   The changes to the `StatisticsExtractor` API are to return min/max 
statistics by different functions rather than enum. This allows re-matching the 
relevant fields as well as using the the same basic API to extract min/max 
statistics for pages as well (`page_mins()`, `page_maxs()`, `page_counts()`) 
etc.
   
   
   ## Are these changes tested?
   Covered by existing CI tests.
   
   I will also run benchmark tests
   
   ## Are there any user-facing changes?
   The `StatisticsExtractor` API has changed, but since this API has not yet 
been released, this is not strictly a breaking API change
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to