Re: [PR] [SPARK-57000][CORE][SS][RTM] Add concurrent scheduling capabilites for Real-time Mode [spark]

via GitHub Tue, 09 Jun 2026 14:47:42 -0700


jerrypeng commented on PR #56055:
URL: https://github.com/apache/spark/pull/56055#issuecomment-4664362614


   @mridulm thank you for the detailed feedback — I think we're aligned on the 
destination, and I'd like to propose reaching it incrementally.
   
   I agree the end state is: these scheduling semantics supported by the 
default DAGScheduler, with richer, more fine-grained abstractions — e.g. 
annotating in the query plan which shuffles can be read incrementally, rather 
than an opt-in flag. My question is whether we can sequence it into milestones 
rather than land it all at once.
   
   IMO this PR already declares clear semantics for the new scheduling 
capability, and they're fairly generic:
   
   1. The shuffle connecting two concurrent stages is read **incrementally**: 
the consumer reads from a still-running producer instead of waiting for fully 
materialized output.
   2. Because of that, stages with a data dependency can run **concurrently** 
rather than sequentially.
   3. Because that incremental shuffle is **transient** (its data can't be 
replayed), any task failure restarts the whole job.
   
   None of these reference streaming — real-time mode is just the first caller, 
and the capability isn't streaming-specific: any feature that uses an 
incrementally-readable shuffle can opt into the same semantics. The PR gates 
them behind a streaming-named property for expedience 
(`streaming.concurrent.stages.enabled`) — happy to rename it to something more 
generic if you'd like.
     
   I'd also note the DAGScheduler footprint is deliberately small: the 
base-class change is a no-op hook, a couple of accessors, and a few visibility 
relaxations, with the default execution path unchanged — precisely because I 
share your concern that changes there are high-risk. The new behavior is fully 
opt-in, so structuring it this way keeps the blast radius small: unrelated 
queries can't be affected by these changes. That's also what makes landing it 
incrementally low-risk.
     
   Could we use this PR as the first milestone and merge it as-is? It would let 
us test and validate real-time mode end-to-end in-tree while we design the 
deeper integration. As an immediate follow-up, I will work through how to make 
these semantics more natively defined in the DAGScheduler.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57000][CORE][SS][RTM] Add concurrent scheduling capabilites for Real-time Mode [spark]

Reply via email to