buska88 commented on PR #3090: URL: https://github.com/apache/celeborn/pull/3090#issuecomment-2653606341
> > According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921. > > Do you have these two PRs for your Celeborn client? > > No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them. As for this pr, it occurs to me that when a task of a stage throws fetchFail exception, in the small duration between fetchFail exception reported and the stage is aborted, following read tasks cans still get appId and finish shuffle reading.Then when stage retrying, these tasks may not re compute, which save resources.If following tasks fail due to failing to get appId, then they will be recomputed in the next stage-retry inevitable. I think this pr has a tiny influence when we have those two prs you mentioned, so it seems an unnecessary pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
