mridulm commented on PR #2373: URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2080175337
> Totally Agree! Correctness is NO.1 important thing. Based on previous discussion, correctness will only happen when stage rerun, I think we both agree that if this PR takes affects, we will not allow stage rerun, right? As [currently formulated](https://github.com/apache/celeborn/pull/2373/files#diff-c8ece928ffe63a93526fd731b5d092d659255064e6880c428ae25d252324069eR68), we eagerly fail based on config: you are right, this does prevent data corruption - but requires disabling `throwsFetchFailure` in order to leverage the benefits in this PR (and disabling `throwsFetchFailure` which has its own implications, as discussed in CELEBORN-955 for Spark). Which is why I am trying to see if we can keep both benefits :-) `throwsFetchFailure` to allow for resilience in case of shuffle fetch failures, as well as mitigate the performance/timeout issues due to sorting. We dont necessarily need to explore [this proposal](https://github.com/apache/celeborn/pull/2373#issuecomment-2076295830) in current PR btw ! The changes here are flag guarded anyway, so can be independently explored in future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
