devanbenz commented on issue #12088:
URL: https://github.com/apache/datafusion/issues/12088#issuecomment-2307291595

   > A logical plan is relatively easy to understand, physical plans are 
definitely hard to understand, because they include the execution detail for 
exchange-based parallelism
   > 
   > For example, a multi-stage aggregation's physical plan looks like
   > 
   > ```
   > |               |     AggregateExec: mode=FinalPartitioned, gby=[city@0 as 
city], aggr=[COUNT(Int64(1))]                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               |
   > |               |       CoalesceBatchesExec: target_batch_size=8192        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               |
   > |               |         RepartitionExec: partitioning=Hash([city@0], 4), 
input_partitions=4                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               |
   > |               |           AggregateExec: mode=Partial, gby=[city@0 as 
city], aggr=[COUNT(Int64(1))]                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                  |
   > ```
   > 
   > It appears every executor node has one input and one output In reality, 
one executor can have multiple instances being spawned according to the 
partition number, and different executor runtime instance can have one input 
stream + multiple output stream, one input + one output, etc.
   > 
   > The tricky thing is the data flow in execution is an "expanded" physical 
plan tree, which should be visualized with some imagination, instead of being a 
1-to-1 mapping of the physical plan
   > 
   > I would recommend first explain what is exchange based parallelism with a 
concrete example (looks like muti-staged parallel aggregation is a good one) 
Then specify the behavior of different executors (how they split/merge data 
stream)
   
   @2010YOUY01 I'm personally not very familiar with exchange based 
parallelism. Could you point me in the direction of a good paper/resource on 
the topic. Assuming `How Query Engines Work` has a good section on it: 
https://howqueryengineswork.com/12-parallel-query.html 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to