jerrypeng commented on PR #56055:
URL: https://github.com/apache/spark/pull/56055#issuecomment-4598296739

   Thanks @mridulm for the question, and for your interest in this work!
   
   The short answer is that barrier execution mode and concurrent stage 
scheduling solve orthogonal problems. As I understand it, barrier mode is gang 
scheduling for the tasks within a single stage: it launches all N tasks of that 
stage simultaneously, and the tasks can then coordinate with each other 
mid-execution via barrier() / allGather() (MPI-style).
   
   What real-time mode needs is different — the ability to schedule multiple 
stages of a job to run concurrently (which is what this PR focuses on), so 
records can stream from upstream stages to downstream stages through a 
streaming shuffle. There's no hard requirement for all tasks to coordinate, or 
to be co-scheduled, before the query starts.
   
   Your question — whether RTM could benefit from gang scheduling — is a fair 
one. I think the answer is "maybe, but not strictly necessary." The streaming 
shuffle implements a backpressure mechanism that serves a similar purpose: if a 
downstream consumer isn't ready yet, the upstream producer backs off rather 
than failing, thus a coordinate execution system like barrier scheduling is not 
needed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to