mridulm commented on PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2080175337

   > Totally Agree! Correctness is NO.1 important thing. Based on previous 
discussion, correctness will only happen when stage rerun, I think we both 
agree that if this PR takes affects, we will not allow stage rerun, right?
   
   As [currently 
formulated](https://github.com/apache/celeborn/pull/2373/files#diff-c8ece928ffe63a93526fd731b5d092d659255064e6880c428ae25d252324069eR68),
 we eagerly fail based on config: you are right, this does prevent data 
corruption - but requires disabling `throwsFetchFailure` in order to leverage 
the benefits in this PR (and disabling `throwsFetchFailure` which has its own 
implications, as discussed in CELEBORN-955 for Spark).
   
   Which is why I am trying to see if we can keep both benefits :-) 
`throwsFetchFailure` to allow for resilience in case of shuffle fetch failures, 
as well as mitigate the performance/timeout issues due to sorting.
   
   We dont necessarily need to explore [this 
proposal](https://github.com/apache/celeborn/pull/2373#issuecomment-2076295830) 
in current PR btw ! The changes here are flag guarded anyway, so can be 
independently explored in future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to