Re: [PR] [SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks [spark]

via GitHub Wed, 25 Dec 2024 19:47:47 -0800


mridulm commented on code in PR #49270:
URL: https://github.com/apache/spark/pull/49270#discussion_r1897596703



##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2937,7 +2938,9 @@ private[spark] class DAGScheduler(
         } else {
           // This stage is only used by the job, so finish the stage if it is 
running.
           val stage = stageIdToStage(stageId)
-          if (runningStages.contains(stage)) {
+          val shouldKill = runningStages.contains(stage) ||
+            (waitingStages.contains(stage) && stage.resubmitInFetchFailed)

Review Comment:
   I like @Ngone51's suggestion better - simply check for 
`stage.failedAttemptIds.nonEmpty || runningStages.contains(stage)`.
   I can see an argument being made for failed as well.
   With this, the PR will boil down to this change and tests to stress this 
logic ofcourse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks [spark]

Reply via email to