mridulm commented on pull request #33872:
URL: https://github.com/apache/spark/pull/33872#issuecomment-915612067
@Ngone51 Before this fix, we have the following three cases primarily:
1. Final ResultStage
- These wont have a task re-execution as the result has already been
fetched (collect/save/etc).
2. ShuffleMapStage with ESS enabled.
- If the node has not gone down as well, but only executor has failed,
then we wont have re-execution.
3. ShuffleMapStage without ESS enabled.
- This will have re-execution as the shuffle blocks are lost.
In addition to above, if the stage was persisting data - that would also
trigger re-execution when there is a failure to fetch the RDD blocks (assuming
no replication, etc). Given this, we wont necessarily have stage re-execution
in all cases in existing master - but only when there is need for data which is
actually lost.
On the other hand, the task events would be inconsistent whenever the
problem is hit.
After the change, we will re-execute the task for all of the cases above if
this issue is hit (executor failure racing against successful result fetch).
I am actually fine with this change in behavior given the consistent way we
handle events.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]