mridulm commented on pull request #34461:
URL: https://github.com/apache/spark/pull/34461#issuecomment-964557253


   @Ngone51 For the example you gave, namely:
   stage1/attempt1 (parent) -> stage2/attempt2 (child)  -- fetch failed --> 
stage1/attempt2 -> stage2/attempt2
   
   Here, we have two cases:
   
   * If stage1 was non determinate stage - we throw away all merged output from 
stage1/attempt1 when we run stage1/attempt2.
     * This happens via `shuffleDep.newShuffleMergeState()` which is invoked 
from beginning of 
[submitMissingTasks](https://github.com/apache/spark/blob/c4e975e175c01f67ece7ae492a79554ad1b44106/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1374)
   * If stage1 was deterministic stage - we disable shuffle push for 
stage1/attempt2, and run stage1/attempt2.
     * This happens in `submitMissingTasks` if it is already merged finalized, 
see 
[here](https://github.com/apache/spark/blob/c4e975e175c01f67ece7ae492a79554ad1b44106/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1403).
   
   In all of these combinations, whether stage2/attemptN uses merged output or 
not must only be based on whether merged output exists for it to use - based on 
parent stage.
   Currently, we are using a single variable to control behavior at both mapper 
side (push side) and reducer side (using merged output) - this was fine 
initially - but given for deterministic parent stages and retried child stages, 
we should make them separate.
   
   IMO we should add a new variable, say `useMergedShuffleInput` - which will 
be based on whether there exists merged output from parent or not for that 
shuffle dep.
   
   Thoughts @Victsm, @zhouyejoe, @otterc, @rmcyang ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to