2010YOUY01 commented on issue #12088: URL: https://github.com/apache/datafusion/issues/12088#issuecomment-2300489412
A logical plan is relatively easy to understand, physical plans are definitely hard to understand, because they include the execution detail for exchange-based parallelism For example, a multi-stage aggregation's physical plan looks like ``` | | | AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[COUNT(Int64(1))] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([city@0], 4), input_partitions=4 | | | AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[COUNT(Int64(1))] | ``` It appears every executor node has one input and one output In reality, one executor can have multiple instances being spawned according to the partition number, and different executor runtime instance can have one input stream + multiple output stream, one input + one output, etc. The tricky thing is the data flow in execution is an "expanded" physical plan tree, which should be visualized with some imagination, instead of being a 1-to-1 mapping of the physical plan I would recommend first explain what is exchange based parallelism with a concrete example (looks like muti-staged parallel aggregation is a good one) Then specify the behavior of different executors (how they split/merge data stream) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org