weixiuli commented on pull request #30716: URL: https://github.com/apache/spark/pull/30716#issuecomment-750233059
> After taking a look at the updated UT, I think the problem mentioned by @weixiuli is: > > 1. task X of Stage B encountered FetchFailure while fetching the shuffle data of Stage A at executor-0; Stage A unregisters the MapOutput for the mapIndex-0. > 2. Stage A and Stage B both marked as failed and Stage A starts to rerun after a while. > 3. task Y (run at executor-0 too) of rerun Stage A success and register its MapOutput for the same mapIndex-0 > 4. task Z of Stage B encountered FetchFailure at executor-0 too and Stage A unregister the MapOutput of mapIndex-1 again > > I think what @weixiuli is trying to do here is to avoid unregistering the MapOutput in step 4 again. > > However, I wonder whether it could happen in a real job. Or say, it's really a corner case. I think tasks like task Z usually failed early before the rerun task Y success. This is a potential problem, especially if the network is poor and resources are scarce. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
