liupc commented on issue #27604: [SPARK-30849][CORE][SHUFFLE]Fix application 
failed due to failed to get MapStatuses broadcast block
URL: https://github.com/apache/spark/pull/27604#issuecomment-596449418
 
 
   > Ok, I get your point now. Let me paraphrase it to see if I understand 
correctly:
   > 
   > Assuming we have stage0 finished while stage1 and stage2 are running 
concurrently and both depend on stage0.
   > 
   > Task from stage1 hit `FetchFailedException` and causes stage0 to re-run. 
At the same time, task X in stage2 is still running. Since there's multiple 
tasks from stage0 are running at the same time and each time a task from stage0 
finished will invalidate cached map status(destroy broadcast), thus, task X has 
high possibility to hit IOException(a.k.a `Failed to get broadcast`) after 
fetching broadcasted map status from driver(because tasks from stage0 are 
continuously destroying the broadcast at the same time).
   > 
   > Also, in `TaskSetManager` side, it treats the exception as a counted task 
failure(rather than FetchFailed) and retry the task and then hit the same 
exception again and again.
   
   That's it! Thanks for reviewing.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to