jerrypeng commented on PR #56055: URL: https://github.com/apache/spark/pull/56055#issuecomment-4599583040
@mridulm — This change is not needed when the streaming query is a single stage: a single long-running (map) stage runs fine on the existing scheduler, which is exactly why RTM support for single-stage stateless queries already shipped in 4.1 — no scheduler change required there. However, multi-stage queries (e.g. stateful queries) are today executed one stage at a time, with each stage's shuffle output fully materialized before the next stage starts. To reach millisecond-level latencies, we instead need the stages of a single query to run concurrently, connected by the streaming shuffle (currently being merged in incrementally). Enabling concurrent execution of dependent stages within a single job is what requires the scheduler change — which is why this work is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
