Re: [PR] [SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks [spark]

via GitHub Tue, 24 Dec 2024 04:18:14 -0800


yabola commented on code in PR #49270:
URL: https://github.com/apache/spark/pull/49270#discussion_r1896703317



##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2937,7 +2937,9 @@ private[spark] class DAGScheduler(
         } else {
           // This stage is only used by the job, so finish the stage if it is 
running.
           val stage = stageIdToStage(stageId)
-          if (runningStages.contains(stage)) {
+          val isRunningStage = runningStages.contains(stage) ||
+            (waitingStages.contains(stage) && 
taskScheduler.hasRunningTasks(stageId))

Review Comment:
   > what if we just kill all waiting stages? Does 
taskScheduler.killAllTaskAttempts handle it well?
   I tested that: if the normally generated waiting stages call 
`killAllTaskAttempts`,  the stage status will be displayed as `FAILED`, which 
was `SKIPPED` before, killAllTaskAttempts  itself will not go wrong.
   Always kill waiting stage seems to be a safer approach (tasks shouldn't be 
run anymore), but it may generate unnecessary `stageFailed` events compared 
with before.
   
   > and can we add a special flag to indicate the waiting stages that are 
submitted due to retry?
   Yes, we can add a flag , please see the update codes. 
   Actually , there is a trade-off here to kill waiting stages
   - always kill
   - kill stages had failed ( `stage#failedAttemptIds` > 0)
   - kill stages failed when fetch failed (`stage#resubmitInFetchFailed`)
   - kill stages only have running tasks
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks [spark]

Reply via email to