mridulm commented on PR #56055:
URL: https://github.com/apache/spark/pull/56055#issuecomment-4636783224

   I am not in favor of merging this PR.
   [This 
comment](https://github.com/apache/spark/pull/56055#issuecomment-4616968863) is 
directionally better aligned with how we should approach it.
   
   Strawman proposal - extend support for realtime shuffle as a first class 
concept within DAGScheduler.
   
   Currently we have:
   * Narrow dependency between RDD's - merge into the same stage.
   * Shuffle dependency - introduce a shuffle split (unless it is provable that 
we can convert it to narrow dependency).
   
   With semantics around how to handle failures, etc.
   
   Extend this to support real time shuffle as a first class support, and 
define :
   a) Given a job, how it gets 'split' into stages and wire them based on real 
time shuffle dependency (when to split, when to combine within stage)
   b) Which stages can be concurrently executed and which need to wait.
   c) What are the semantics around failures
   d) How does this interact with existing constructs (for ex: if there is 
'regular' shuffle dependency ? throw exception ? supported ?)
   
   This PR is good to test things out and validate ideas - but not for merging 
into Apache Spark itself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to