mridulm commented on pull request #33872:
URL: https://github.com/apache/spark/pull/33872#issuecomment-915612067


   @Ngone51 Before this fix, we have the following three cases primarily:
   
   1. Final ResultStage
       - These wont have a task re-execution as the result has already been 
fetched (collect/save/etc).
   2. ShuffleMapStage with ESS enabled.
       - If the node has not gone down as well, but only executor has failed, 
then we wont have re-execution.
   3. ShuffleMapStage without ESS enabled.
       - This will have re-execution as the shuffle blocks are lost.
   
   In addition to above, if the stage was persisting data - that would also 
trigger re-execution when there is a failure to fetch the RDD blocks (assuming 
no replication, etc). Given this, we wont necessarily have stage re-execution 
in all cases in existing master - but only when there is need for data which is 
actually lost.
   On the other hand, the task events would be inconsistent whenever the 
problem is hit.
   
   After the change, we will re-execute the task for all of the cases above if 
this issue is hit (executor failure racing against successful result fetch).
   I am actually fine with this change in behavior given the consistent way we 
handle events.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to