[I] Improve metrics in `DataSourceExec` with `Parquet` source [datafusion]

via GitHub Tue, 21 Oct 2025 03:44:24 -0700


2010YOUY01 opened a new issue, #18195:
URL: https://github.com/apache/datafusion/issues/18195


   ### Is your feature request related to a problem or challenge?
   
   When I am working on setting the default metrics in parquet scanner in 
https://github.com/apache/datafusion/issues/18116, I have several ideas to 
further improve the metrics accounting  in `EXPLAIN ANALYZE` for the parquet 
scanner.
   
   - [ ] Support a new metric value `files_ranges_matched_statistics`
   - [ ] Add a new metric value `scan_efficiency_ratio`
   - [ ] Fix `elapsed_compute` baseline metrics not counting issue
   - [ ] Add a new metric type for the general pruning-related metrics
   
   ### Support a new metric value `files_ranges_matched_statistics`
   There is a existing metric `files_ranges_pruned_statistics`
   
https://github.com/apache/datafusion/blob/155b56e521d75186776a65f1634ee03058899a79/datafusion/datasource-parquet/src/metrics.rs#L44
   It would be good also to display how many files ranges are matched to make 
it more comprehensive, similar to the existing row-group/page level metrics.
   
   ### Add a new metric value `scan_efficiency_ratio`
   I think it would be helpful to track:
   ```
   scan_efficiency_ratio -- bytes_scanned / total_file_size, as a quick insight 
for the overall pruning effectiveness
   ```
   
   ### Fix `elapsed_compute` baseline metrics not counting issue
   It seems currently the `elapsed_compute` baseline metric is not tracked, you 
can try any whole file scan on parquet source in `datafusion-cli`, the metric 
will be unrealistically low:
   ```
   DataFusion CLI v50.2.0
   > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
   STORED AS parquet
   LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
   0 row(s) fetched.
   Elapsed 0.049 seconds.
   
   > explain analyze
   select * from lineitem;
   
   ...elapsed_compute = 14ns...
   ```
   
   ### Add a new metric type for the general pruning-related metrics
   There are many levels of pruning inside parquet scanner: file range/row 
group stat/row group bloom filter/page index, ...
   It's currently displayed like ` row_groups_matched_statistics=3, 
row_groups_pruned_statistics=7`
   
   I think display it as `row_groups_statistics_pruning= 10 total -> 3 matched` 
looks better, and can make the lengthy existing metrics output more concise.
   
   To do it, we can add a new metric value type, and change its display 
implementation.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Improve metrics in `DataSourceExec` with `Parquet` source [datafusion]

Reply via email to