Re: [I] Document "how to read an explain plan" [datafusion]

via GitHub Tue, 20 Aug 2024 20:14:32 -0700


2010YOUY01 commented on issue #12088:
URL: https://github.com/apache/datafusion/issues/12088#issuecomment-2300489412


   A logical plan is relatively easy to understand, physical plans are 
definitely hard to understand, because they include the execution detail for 
exchange-based parallelism
   
   For example, a multi-stage aggregation's physical plan looks like
   ```                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                  |
   |               |     AggregateExec: mode=FinalPartitioned, gby=[city@0 as 
city], aggr=[COUNT(Int64(1))]                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               |
   |               |       CoalesceBatchesExec: target_batch_size=8192          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |         RepartitionExec: partitioning=Hash([city@0], 4), 
input_partitions=4                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               |
   |               |           AggregateExec: mode=Partial, gby=[city@0 as 
city], aggr=[COUNT(Int64(1))]                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                  |
   ```
   It appears every executor node has one input and one output
   In reality, one executor can have multiple instances being spawned according 
to the partition number, and different executor runtime instance can have one 
input stream + multiple output stream, one input + one output, etc.
   
   The tricky thing is the data flow in execution is an "expanded" physical 
plan tree, which should be visualized with some imagination, instead of being a 
1-to-1 mapping of the physical plan
   
   I would recommend first explain what is exchange based parallelism with a 
concrete example (looks like muti-staged parallel aggregation is a good one)
   Then specify the behavior of different executors (how they split/merge data 
stream)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Document "how to read an explain plan" [datafusion]

Reply via email to