RexXiong commented on PR #3531: URL: https://github.com/apache/celeborn/pull/3531#issuecomment-3532885543
I don't think this change is quite right. For example, if attempts 0 and 1 have already failed, attempt 2 is running, and attempt 3 reports failed, according to the modified logic it would return true. But in reality, it should return false because attempt 2 is still running. So we should check how many attempts have already failed. If the number of failed attempts has reached maxTaskFails, then it should return true. In the scenario you provided, when attempt 3 reports, although attempt 4 is running, attempts 0, 1, and 2 have already failed three times, plus this time makes it four failures. At this point, we shouldn’t ignore it; instead, we should determine that maxTaskFails has been reached and report a fetch failure. @leixm @turboFei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
