mridulm commented on pull request #34461:
URL: https://github.com/apache/spark/pull/34461#issuecomment-964557253
@Ngone51 For the example you gave, namely:
stage1/attempt1 (parent) -> stage2/attempt2 (child) -- fetch failed -->
stage1/attempt2 -> stage2/attempt2
Here, we have two cases:
* If stage1 was non determinate stage - we throw away all merged output from
stage1/attempt1 when we run stage1/attempt2.
* This happens via `shuffleDep.newShuffleMergeState()` which is invoked
from beginning of
[submitMissingTasks](https://github.com/apache/spark/blob/c4e975e175c01f67ece7ae492a79554ad1b44106/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1374)
* If stage1 was deterministic stage - we disable shuffle push for
stage1/attempt2, and run stage1/attempt2.
* This happens in `submitMissingTasks` if it is already merged finalized,
see
[here](https://github.com/apache/spark/blob/c4e975e175c01f67ece7ae492a79554ad1b44106/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1403).
In all of these combinations, whether stage2/attemptN uses merged output or
not must only be based on whether merged output exists for it to use - based on
parent stage.
Currently, we are using a single variable to control behavior at both mapper
side (push side) and reducer side (using merged output) - this was fine
initially - but given for deterministic parent stages and retried child stages,
we should make them separate.
IMO we should add a new variable, say `useMergedShuffleInput` - which will
be based on whether there exists merged output from parent or not for that
shuffle dep.
Thoughts @Victsm, @zhouyejoe, @otterc, @rmcyang ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]