weixiuli commented on pull request #30716:
URL: https://github.com/apache/spark/pull/30716#issuecomment-750233059


   
   
   
   > After taking a look at the updated UT, I think the problem mentioned by 
@weixiuli is:
   > 
   > 1. task X of Stage B encountered FetchFailure while fetching the shuffle 
data of Stage A at executor-0; Stage A unregisters the MapOutput for the 
mapIndex-0.
   > 2. Stage A and Stage B both marked as failed and Stage A starts to rerun 
after a while.
   > 3. task Y (run at executor-0 too) of rerun Stage A success and register 
its MapOutput for the same mapIndex-0
   > 4. task Z of Stage B encountered FetchFailure at executor-0 too and Stage 
A unregister the MapOutput of mapIndex-1 again
   > 
   > I think what @weixiuli is trying to do here is to avoid unregistering the 
MapOutput in step 4 again.
   > 
   > However, I wonder whether it could happen in a real job. Or say, it's 
really a corner case. I think tasks like task Z usually failed early before the 
rerun task Y success.
   
   This is a potential problem, especially if the network is poor and resources 
are scarce.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to