RexXiong commented on PR #3650: URL: https://github.com/apache/celeborn/pull/3650#issuecomment-4211073033
Overall, this change looks good to me. The approach of retrieving failure counts directly from `TaskSetManager.numFailures` is more accurate than manually counting failed task attempts, especially for cases like container preemption where Spark doesn't increment the failure count. One suggestion: It would be helpful to add a test case that verifies the failure count is correctly incremented after an actual task failure (e.g., simulate a task failure and then verify that `getTaskFailureCount` returns the expected increased value). Currently, the test only validates boundary conditions (initial value, out-of-bounds indices), but doesn't cover the actual failure counting scenario. --- by claude -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
