alamb opened a new pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858


   # Which issue does this PR close?
   Resolves https://github.com/apache/arrow-datafusion/issues/779
   
    # Rationale for this change
   `EXPLAIN PLAN` is great to understand what DataFusion plans to do, but it is 
hard today using the SQL or dataframe interface to understand in more depth 
what *actually happened* during execution.
   
   My real usecase is being able to see how many rows flowed through each 
operator as well as the  "time spent" and "rows produced" by each operator.
   
   # What changes are included in this PR?
   1. Add basic plan nodes for `EXPLAIN ANALYZE` and EXPLAIN ANALYZE VERBOSE` 
(example below) sql
   2. Refactor special case `ParquetStrream` into `RecordBatchReceiverStream` 
for reuse
   
   
   # Are there any user-facing changes?
   Yes, `EXPLAIN ANALYZE` now does something different than `EXPLAIN`
   
   ## Example of use
   ```shell
   echo "1,A" > /tmp/foo.csv
   echo "1,B" >> /tmp/foo.csv
   echo "2,A" >> /tmp/foo.csv
   ```
   
   Run CLI
   ```shell
   cargo run --bin datafusion-cli
   ```
   
   ```sql
   CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION 
'/tmp/foo.csv';
   ```
   
   ## Example `EXPLAIN ANALYZE` output
   
   ```sql
   > EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                   |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[]                     
                                                                                
                   |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], 
metrics=[]                                                                      
                         |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 
as b], aggr=[SUM(x)], metrics=[outputRows=2]                                    
                   |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, 
metrics=[]                                                                      
                       |
   |                   |         RepartitionExec: partitioning=Hash([Column { 
name: "b", index: 0 }], 16), metrics=[sendTime=839560, fetchTime=122528525, 
repartitionTime=5327877] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as 
b], aggr=[SUM(x)], metrics=[outputRows=2]                                       
                   |
   |                   |             RepartitionExec: 
partitioning=RoundRobinBatch(16), metrics=[fetchTime=5660489, 
repartitionTime=0, sendTime=8012]                              |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: 
[/tmp/foo.csv]), has_header=false, metrics=[]                                   
                         |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.012 seconds.
   ```
   
   ## Example `EXPLAIN ANALYZE VERBOSE` output
   
   ```sql
   > EXPLAIN ANALYZE VERBOSE SELECT SUM(x) FROM foo GROUP BY b;
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                   |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[]                     
                                                                                
                   |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], 
metrics=[]                                                                      
                         |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 
as b], aggr=[SUM(x)], metrics=[outputRows=2]                                    
                   |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, 
metrics=[]                                                                      
                       |
   |                   |         RepartitionExec: partitioning=Hash([Column { 
name: "b", index: 0 }], 16), metrics=[repartitionTime=6584110, 
fetchTime=132927514, sendTime=904001] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as 
b], aggr=[SUM(x)], metrics=[outputRows=2]                                       
                   |
   |                   |             RepartitionExec: 
partitioning=RoundRobinBatch(16), metrics=[repartitionTime=0, sendTime=8246, 
fetchTime=6239096]                              |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: 
[/tmp/foo.csv]), has_header=false, metrics=[]                                   
                         |
   | Output Rows       | 2                                                      
                                                                                
                   |
   | Duration          | 10.283764ms                                            
                                                                                
                   |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   3 rows in set. Query took 0.014 seconds.
   ```
   
   # Future work:
   Note this PR is designed just to hook up / plumb the existing code and 
metrics we have into SQL (basically what got added in 
https://github.com/apache/arrow-datafusion/pull/662). I plan a sequence of 
follow on PRs to both improve the metrics infrastructure 
https://github.com/apache/arrow-datafusion/issues/679 and add/fix the metrics 
that are actually reported so they are consistent. The specific metrics that 
are displayed are verbose and somewhat ad hoc at the moment.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to