alamb opened a new issue, #16402:
URL: https://github.com/apache/datafusion/issues/16402

   ### Is your feature request related to a problem or challenge?
   
   - This is a follow on to the feature added by @adriangb  in 
https://github.com/apache/datafusion/pull/16014
   
   @adriangb added the great feature that can prune entire files while opening 
many parquet files
   
   The current statistics for `DataSourceExec` have information on how many row 
groups were pruned, it would also be great to add statistics on how many 
**FILES** were pruned by this new code
   
   For example, with clickbench Q24 here is an excerpt from the file
   
   ```sql
   EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' 
ORDER BY "SearchPhrase" LIMIT 10;
   ```
   
   
   ```
   |                   |         DataSourceExec:...
    pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, 
row_groups_matched_statistics=325, row_groups_pruned_bloom_filter=0, 
row_groups_pruned_statistics=0
   
   
   
   ### Describe the solution you'd like
   
   I would like some new statistics that record:
   * `files_pruned`: total files that were pruned by filters during open
   
   It is important to make sure the docs explain the metric only describes 
files pruned after the plan starts (not files that are pruned during planning)
   
   
   
   ### Describe alternatives you've considered
   
   1. Add a field to `ParquetFileMetrics`: 
https://github.com/apache/datafusion/blob/6d5e00ad3f8e53f7252cb1d3c72a6c7f28c1aed6/datafusion/datasource-parquet/src/metrics.rs#L29-L28
   2. Thread that through to the opener in 
`datafusion/datasource-parquet/src/opener.rs` so when files are pruned we can 
see that in the metrics
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to