mridulm commented on PR #56055: URL: https://github.com/apache/spark/pull/56055#issuecomment-4636783224
I am not in favor of merging this PR. [This comment](https://github.com/apache/spark/pull/56055#issuecomment-4616968863) is directionally better aligned with how we should approach it. Strawman proposal - extend support for realtime shuffle as a first class concept within DAGScheduler. Currently we have: * Narrow dependency between RDD's - merge into the same stage. * Shuffle dependency - introduce a shuffle split (unless it is provable that we can convert it to narrow dependency). With semantics around how to handle failures, etc. Extend this to support real time shuffle as a first class support, and define : a) Given a job, how it gets 'split' into stages and wire them based on real time shuffle dependency (when to split, when to combine within stage) b) Which stages can be concurrently executed and which need to wait. c) What are the semantics around failures d) How does this interact with existing constructs (for ex: if there is 'regular' shuffle dependency ? throw exception ? supported ?) This PR is good to test things out and validate ideas - but not for merging into Apache Spark itself -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
