s0nskar commented on PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2037365732

   Also, I think this issue would not be only limited to ResultStage, this can 
happen with ShuffleMapStage as well in some complex cases. Consider another 
scenario –
   
   `ShuffleMapStage1 -----> ShuffleMapStage2 -----> `
   
   - Similar to above example, let's say partition skew P0 generated by 
`ShuffleMapStage1`.
   - ShuffleMapStage2 gets FetchFailure while reading sub-partitions of 
ShuffleMapStage1.
   - ShuffleMapStage1 will be recomputed and shuffle outputs will be cleared.
   - Only missing task of ShuffleMapStage2 will be retries, again causing the 
same issue.
   
   This is case though, we can rollback the whole lineage till this point 
instead of failing this job. Similar to what vanilla spark does, what this will 
be very expensive.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to