jerrypeng commented on PR #56055: URL: https://github.com/apache/spark/pull/56055#issuecomment-4598296739
Thanks @mridulm for the question, and for your interest in this work! The short answer is that barrier execution mode and concurrent stage scheduling solve orthogonal problems. As I understand it, barrier mode is gang scheduling for the tasks within a single stage: it launches all N tasks of that stage simultaneously, and the tasks can then coordinate with each other mid-execution via barrier() / allGather() (MPI-style). What real-time mode needs is different — the ability to schedule multiple stages of a job to run concurrently (which is what this PR focuses on), so records can stream from upstream stages to downstream stages through a streaming shuffle. There's no hard requirement for all tasks to coordinate, or to be co-scheduled, before the query starts. Your question — whether RTM could benefit from gang scheduling — is a fair one. I think the answer is "maybe, but not strictly necessary." The streaming shuffle implements a backpressure mechanism that serves a similar purpose: if a downstream consumer isn't ready yet, the upstream producer backs off rather than failing, thus a coordinate execution system like barrier scheduling is not needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
