[GitHub] [spark] mridulm commented on a change in pull request #34834: [SPARK-37580][CORE] TaskSet abort due to the task failed times reach the max threshold should check if another attempt is succeeded

GitBox Fri, 17 Dec 2021 23:22:40 -0800


mridulm commented on a change in pull request #34834:
URL: https://github.com/apache/spark/pull/34834#discussion_r771794273




##########
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##########
@@ -960,19 +960,27 @@ private[spark] class TaskSetManager(
       assert (null != failureReason)
       taskSetExcludelistHelperOpt.foreach(_.updateExcludedForFailedTask(
         info.host, info.executorId, index, failureReason))
-      numFailures(index) += 1
-      if (numFailures(index) >= maxTaskFailures) {
-        logError("Task %d in stage %s failed %d times; aborting job".format(
-          index, taskSet.id, maxTaskFailures))
-        abort("Task %d in stage %s failed %d times, most recent failure: 
%s\nDriver stacktrace:"
-          .format(index, taskSet.id, maxTaskFailures, failureReason), 
failureException)
-        return
+      if (!successful(index)) {

Review comment:
       Agree with reseting the counter (I added more above).

##########
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##########
@@ -820,6 +820,7 @@ private[spark] class TaskSetManager(
         s"on ${info.host} (executor ${info.executorId}) 
($tasksSuccessful/$numTasks)")
       // Mark successful and stop if all the tasks have succeeded.
       successful(index) = true
+      numFailures(index) = 0

Review comment:
       Our current definition of `spark.task.maxFailures` is slightly ambiguous 
in this regard.
   From docs:
   > 
   >     Number of failures of any particular task before giving up on the job.
   >     The total number of failures spread across different tasks will not 
cause the job
   >     to fail; a particular task has to fail this number of attempts.
   >     Should be greater than or equal to 1. Number of allowed retries = this 
value - 1.
   > 
   
   I would argue any successful task should reset the counter, as that was the 
original intent; and given that - this change (to reset to 0) should be fine.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##########
@@ -960,19 +960,27 @@ private[spark] class TaskSetManager(
       assert (null != failureReason)
       taskSetExcludelistHelperOpt.foreach(_.updateExcludedForFailedTask(
         info.host, info.executorId, index, failureReason))
-      numFailures(index) += 1
-      if (numFailures(index) >= maxTaskFailures) {
-        logError("Task %d in stage %s failed %d times; aborting job".format(
-          index, taskSet.id, maxTaskFailures))
-        abort("Task %d in stage %s failed %d times, most recent failure: 
%s\nDriver stacktrace:"
-          .format(index, taskSet.id, maxTaskFailures, failureReason), 
failureException)
-        return
+      if (!successful(index)) {
+        numFailures(index) += 1
+        if (numFailures(index) >= maxTaskFailures && !successful(index)) {

Review comment:
       Agree with both @jiangxb1987 and @Ngone51, please remove 
`!successful(index)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a change in pull request #34834: [SPARK-37580][CORE] TaskSet abort due to the task failed times reach the max threshold should check if another attempt is succeeded

Reply via email to