Re: [PR] [CELEBORN-1896] delete data from failed to fetch shuffles [celeborn]

via GitHub Wed, 30 Apr 2025 21:46:27 -0700


CodingCat commented on PR #3109:
URL: https://github.com/apache/celeborn/pull/3109#issuecomment-2844097622


   > Given this applies to indeterminate stages (and not determinate stages), 
in the scenario presented where stage 0 is indeterminate - we must recompute 
stage 2 when stage 1 suffers from a fetch failure : else the data fetched for 
stage 1, when stage 0.1 runs - will no longer be consistent with what was 
already fetched for stage 2.
   > 
   > A variant of this is discussed here: 
[apache/spark#50630](https://github.com/apache/spark/pull/50630) (unfortunately 
a long discussion).
   > 
   > Spark 4.1 will end up enforcing this for its own shuffle - for Apache 
Celeborn, we will need to do something similar (I have not brought this up, as 
the PR is not yet merged to Apache Spark :) )
   > 
   > Do let me know if I am misunderstanding the scenario !
   
   I think you are right, great insight!
   
   I just added the code to handle indeterministic cases as well as the test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1896] delete data from failed to fetch shuffles [celeborn]

Reply via email to