s0nskar commented on PR #2373: URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2037365732
Also, I think this issue would not be only limited to ResultStage, this can happen with ShuffleMapStage as well in some complex cases. Consider another scenario – `ShuffleMapStage1 -----> ShuffleMapStage2 -----> ` - Similar to above example, let's say partition skew P0 generated by `ShuffleMapStage1`. - ShuffleMapStage2 gets FetchFailure while reading sub-partitions of ShuffleMapStage1. - ShuffleMapStage1 will be recomputed and shuffle outputs will be cleared. - Only missing task of ShuffleMapStage2 will be retries, again causing the same issue. This is case though, we can rollback the whole lineage till this point instead of failing this job. Similar to what vanilla spark does, what this will be very expensive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
