SteNicholas commented on code in PR #3531:
URL: https://github.com/apache/celeborn/pull/3531#discussion_r2533882891


##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -362,22 +365,41 @@ public static boolean 
shouldReportShuffleFetchFailure(long taskId) {
                   taskId,
                   taskInfo.attemptNumber(),
                   ti.attemptNumber());
-              return false;
-            }
-          } else {
-            if (ti.attemptNumber() >= maxTaskFails - 1) {
-              logger.warn(
-                  "StageId={} index={} taskId={} attemptNumber {} reach 
maxTaskFails {}.",
+              hasRunningAttempt = true;
+            } else if (ti.status() == "FAILED" || ti.status() == "UNKNOWN") {
+              // For KILLED state task, Spark does not count the number of 
failures
+              // For UNKNOWN state task, Spark does count the number of 
failures
+              // For FAILED state task, Spark decides whether to count the 
failure based on the
+              // different failure reasons. In this case, since we cannot 
obtain the failure reasons,
+              // we will count all tasks in FAILED state.
+              logger.info(
+                  "StageId={} index={} taskId={} attempt={} another attempt {} 
is failed.",
                   stageId,
                   taskInfo.index(),
                   taskId,
-                  ti.attemptNumber(),
-                  maxTaskFails);
-              return true;
+                  taskInfo.attemptNumber(),
+                  ti.attemptNumber());
+              failedTaskAttempts += 1;
             }
           }
         }
-        return true;
+        // The following situations should trigger a FetchFailed exception:
+        //  1. If failedTaskAttempts >= maxTaskFails
+        //  2. If no other taskAttempts are running
+        if (failedTaskAttempts >= maxTaskFails) {

Review Comment:
   The condition of returning true could be merged to merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to