SteNicholas commented on code in PR #3531:
URL: https://github.com/apache/celeborn/pull/3531#discussion_r2533882891
##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -362,22 +365,41 @@ public static boolean
shouldReportShuffleFetchFailure(long taskId) {
taskId,
taskInfo.attemptNumber(),
ti.attemptNumber());
- return false;
- }
- } else {
- if (ti.attemptNumber() >= maxTaskFails - 1) {
- logger.warn(
- "StageId={} index={} taskId={} attemptNumber {} reach
maxTaskFails {}.",
+ hasRunningAttempt = true;
+ } else if (ti.status() == "FAILED" || ti.status() == "UNKNOWN") {
+ // For KILLED state task, Spark does not count the number of
failures
+ // For UNKNOWN state task, Spark does count the number of
failures
+ // For FAILED state task, Spark decides whether to count the
failure based on the
+ // different failure reasons. In this case, since we cannot
obtain the failure reasons,
+ // we will count all tasks in FAILED state.
+ logger.info(
+ "StageId={} index={} taskId={} attempt={} another attempt {}
is failed.",
stageId,
taskInfo.index(),
taskId,
- ti.attemptNumber(),
- maxTaskFails);
- return true;
+ taskInfo.attemptNumber(),
+ ti.attemptNumber());
+ failedTaskAttempts += 1;
}
}
}
- return true;
+ // The following situations should trigger a FetchFailed exception:
+ // 1. If failedTaskAttempts >= maxTaskFails
+ // 2. If no other taskAttempts are running
+ if (failedTaskAttempts >= maxTaskFails) {
Review Comment:
The condition of returning true could be merged to merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]