CodingCat commented on PR #3569:
URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3699895075

   @mridulm I would like to have a better understanding about why you think 
Spark's fault tolerance mechanism is "misused" here, specifically how the early 
deletion is different with the scenario that shuffle data suddenly lost when we 
need a retry or reuse it,  or we have a bug in celeborn or mapputtracker the 
metadata of shuffle just get erased mistakenly
   
   
   The reason we converge the handling logic to FetchFailureException is 
exactly to maximize the compatibility with any future changes in Spark. as long 
as Spark relies on this exception to recompute lost data, this PR will be 
compatible . If Spark changes on this part, not only this PR but a big chunk of 
fault tolerance handling part of any shuffle system needs to be changed 
   
   I would love to hear more details about what changes in your mind would make 
this PR incompatible with upstream Spark , other than a statement that "this is 
not right" , "this is misused"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to