2010YOUY01 opened a new issue, #18116:
URL: https://github.com/apache/datafusion/issues/18116

   ### Is your feature request related to a problem or challenge?
   
   Previously, by default, `EXPLAIN ANALYZE` will show all available metrics 
inside an operator. It can get quite verbose for some operator.
   
   In `datafusion-cli`:
   ```
   > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
   STORED AS parquet
   LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > explain analyze select *
   from lineitem
   where l_orderkey = 3000000;
   ```
   
   The parquet reader includes a large number of low-level details:
   ```
   metrics=[output_rows=19813, elapsed_compute=14ns, batches_split=0, 
bytes_scanned=2147308, file_open_errors=0, file_scan_errors=0, 
files_ranges_pruned_statistics=18, num_predicate_creation_errors=0, 
page_index_rows_matched=19813, page_index_rows_pruned=729088, 
predicate_cache_inner_records=0, predicate_cache_records=0, 
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, 
row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, 
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, 
bloom_filter_eval_time=21.997µs, metadata_load_time=273.83µs, 
page_index_eval_time=29.915µs, row_pushdown_eval_time=42ns, 
statistics_eval_time=76.248µs, time_elapsed_opening=4.02146ms, 
time_elapsed_processing=24.787461ms, time_elapsed_scanning_total=24.17671ms, 
time_elapsed_scanning_until_data=23.103665ms]
   ```
   
   I believe only a subset of it is commonly used, for example output_rows, 
metadata_load_time, and how many file/row-group/pages are pruned, and it would 
better to only display the most common ones by default.
   
   After https://github.com/apache/datafusion/pull/18098, the `EXPLAIN ANALYZE` 
detail level can be controlled through an option
   ```
   > set datafusion.explain.analyze_level = summary;
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > explain analyze select * from generate_series(10000) as t1(v1) order by v1 
desc;
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                      |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | SortExec: expr=[v1@0 DESC], 
preserve_partitioning=[false], metrics=[output_rows=10001, 
elapsed_compute=100µs]                                                |
   |                   |   ProjectionExec: expr=[value@0 as v1], 
metrics=[output_rows=10001, elapsed_compute=1.166µs]                            
                                     |
   |                   |     LazyMemoryExec: partitions=1, 
batch_generators=[generate_series: start=0, end=10000, batch_size=8192], 
metrics=[output_rows=10001, elapsed_compute=43µs] |
   |                   |                                                        
                                                                                
                      |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > set datafusion.explain.analyze_level = dev;
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > explain analyze select * from generate_series(10000) as t1(v1) order by v1 
desc;
   
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                |
   
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | SortExec: expr=[v1@0 DESC], 
preserve_partitioning=[false], metrics=[output_rows=10001, 
elapsed_compute=222.043µs, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, 
batches_split=2] |
   |                   |   ProjectionExec: expr=[value@0 as v1], 
metrics=[output_rows=10001, elapsed_compute=2.584µs]                            
                                                               |
   |                   |     LazyMemoryExec: partitions=1, 
batch_generators=[generate_series: start=0, end=10000, batch_size=8192], 
metrics=[output_rows=10001, elapsed_compute=162.625µs]                      |
   |                   |                                                        
                                                                                
                                                |
   
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row(s) fetched.
   Elapsed 0.003 seconds.
   ```
   
   Now only `output_rows` and `elapsed_compute` are included in the `summary` 
analyze level, and the `dev` level is the default.
   
   The goal of this issue is, for each operator:
   - Investigate what's the most commonly used metrics for high-level insights, 
postgres 
   and DuckDB's result can be used as reference
   - Set those common metrics to `summary` type
   
   And finally set `summary` analyze level to default.
   
   ### Describe the solution you'd like
   
   Identify which metrics should be included in the `summary` level for all 
operators.
   
   - [ ] DataSourceExec
   - [ ] DataSinkExec
   - [ ] AnalyzeExec
   - [ ] AsyncFuncExec
   - [ ] AggregateExec
   - [ ] CoalesceBatchesExec
   - [ ] CoalescePartitionsExec
   - [ ] CooperativeExec
   - [ ] CrossJoinExec
   - [ ] EmptyExec
   - [ ] ExplainExec
   - [ ] FilterExec
   - [ ] GlobalLimitExec
   - [ ] LocalLimitExec
   - [ ] HashJoinExec
   - [ ] NestedLoopJoinExec
   - [ ] SortMergeJoinExec
   - [ ] SymmetricHashJoinExec
   - [ ] PlaceholderRowExec
   - [ ] ProjectionExec
   - [ ] RecursiveQueryExec
   - [ ] RepartitionExec
   - [ ] SortExec
   - [ ] PartialSortExec
   - [ ] SortPreservingMergeExec
   - [ ] StreamingTableExec
   - [ ] UnionExec
   - [ ] InterleaveExec
   - [ ] UnnestExec
   - [ ] WindowAggExec
   - [ ] BoundedWindowAggExec
   - [ ] WorkTableExec
   - [ ] LazyMemoryExec
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to