Re: [PR] [CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager [celeborn]

via GitHub Wed, 08 Apr 2026 19:36:02 -0700


RexXiong commented on PR #3650:
URL: https://github.com/apache/celeborn/pull/3650#issuecomment-4211073033


   Overall, this change looks good to me. The approach of retrieving failure 
counts directly from `TaskSetManager.numFailures` is more accurate than 
manually counting failed task attempts, especially for cases like container 
preemption where Spark doesn't increment the failure count.
   
   One suggestion: It would be helpful to add a test case that verifies the 
failure count is correctly incremented after an actual task failure (e.g., 
simulate a task failure and then verify that `getTaskFailureCount` returns the 
expected increased value). Currently, the test only validates boundary 
conditions (initial value, out-of-bounds indices), but doesn't cover the actual 
failure counting scenario.
   
   ---
   by claude


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager [celeborn]

Reply via email to