2010YOUY01 opened a new issue, #18195: URL: https://github.com/apache/datafusion/issues/18195
### Is your feature request related to a problem or challenge? When I am working on setting the default metrics in parquet scanner in https://github.com/apache/datafusion/issues/18116, I have several ideas to further improve the metrics accounting in `EXPLAIN ANALYZE` for the parquet scanner. - [ ] Support a new metric value `files_ranges_matched_statistics` - [ ] Add a new metric value `scan_efficiency_ratio` - [ ] Fix `elapsed_compute` baseline metrics not counting issue - [ ] Add a new metric type for the general pruning-related metrics ### Support a new metric value `files_ranges_matched_statistics` There is a existing metric `files_ranges_pruned_statistics` https://github.com/apache/datafusion/blob/155b56e521d75186776a65f1634ee03058899a79/datafusion/datasource-parquet/src/metrics.rs#L44 It would be good also to display how many files ranges are matched to make it more comprehensive, similar to the existing row-group/page level metrics. ### Add a new metric value `scan_efficiency_ratio` I think it would be helpful to track: ``` scan_efficiency_ratio -- bytes_scanned / total_file_size, as a quick insight for the overall pruning effectiveness ``` ### Fix `elapsed_compute` baseline metrics not counting issue It seems currently the `elapsed_compute` baseline metric is not tracked, you can try any whole file scan on parquet source in `datafusion-cli`, the metric will be unrealistically low: ``` DataFusion CLI v50.2.0 > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem STORED AS parquet LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem'; 0 row(s) fetched. Elapsed 0.049 seconds. > explain analyze select * from lineitem; ...elapsed_compute = 14ns... ``` ### Add a new metric type for the general pruning-related metrics There are many levels of pruning inside parquet scanner: file range/row group stat/row group bloom filter/page index, ... It's currently displayed like ` row_groups_matched_statistics=3, row_groups_pruned_statistics=7` I think display it as `row_groups_statistics_pruning= 10 total -> 3 matched` looks better, and can make the lengthy existing metrics output more concise. To do it, we can add a new metric value type, and change its display implementation. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
