Re: [PR] [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported [celeborn]

via GitHub Wed, 12 Feb 2025 09:31:42 -0800


buska88 commented on PR #3090:
URL: https://github.com/apache/celeborn/pull/3090#issuecomment-2653606341


   > > According to your Jira ticket, "that shuffle fetch fails does not lead 
to stage fail because task speculation and another attempts succeed", I think 
the quoted scenario should not happen if you have PR #3080 and #2921.
   > > Do you have these two PRs for your Celeborn client?
   > 
   > No.We mainly use branch-0.5.Would you please consider that merging these 
two pr into branch-0.5-rc?Those are important features and many users using 
branch-0.5 may need them.
   
   As for this pr, it occurs to me that when a task of a stage throws fetchFail 
exception, in the small duration between fetchFail exception reported and the 
stage is aborted, following read tasks cans still get appId  and finish shuffle 
reading.Then when stage retrying, these tasks may not re compute, which save 
resources.If following tasks fail due to failing to get appId, then they will 
be recomputed in the next stage-retry inevitable.
   I think this pr has a tiny influence when we have those two prs you 
mentioned, so it seems an unnecessary pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported [celeborn]

Reply via email to