wineternity commented on code in PR #38702: URL: https://github.com/apache/spark/pull/38702#discussion_r1035584393
########## core/src/main/scala/org/apache/spark/status/AppStatusListener.scala: ########## @@ -645,8 +645,11 @@ private[spark] class AppStatusListener( } override def onTaskEnd(event: SparkListenerTaskEnd): Unit = { - // TODO: can this really happen? - if (event.taskInfo == null) { + // TODO: can taskInfo null really happen? + // For resubmitted tasks caused by ExecutorLost, the SparkListenerTaskEnd is useless and + // will make activeTask in stage to be negative, this will cause stage not be removed in + // liveStages, and finally cause executor not removed in deadExecutors + if (event.taskInfo == null || event.reason == Resubmitted) { Review Comment: > Couple of things to clarify (the pr description/comments confused me a bit). > > a) There was a pair of task start and task end event which were fired for the task (let us call it Tr) b) When executor which ran Tr was lost, while stage is still running, a `Resubmitted` is fired for Tr. c) Subsequently, a new task start and task end will be fired for the retry of Tr. > > The issue here is, (b) is associated with (a) - which was a success earlier, but has now been marked as a failure. > > We should not be ignoring this event, but rather deal with it properly. For example, update counters, number of failed tasks, etc. The steps is right, and in my opinion the task end event in (b) with the `Resubmitted` reason can be ignored in AppStatusListener, ignore it also make the logic clear. In (a), the counter and metric is already recorded for Tr. eg. the stage.completedTasks, stage.job.completedTasks, stage.executorSummary.taskTime and so on. In (b), we got an extra task end with the "Resubmitted" reason for Tr. thus Tr has one start event with two end event. The activeTask counter in stage will be wrong, which is the root cause for this jira. In (c), the counter and metric will be recorded for the retry of Tr. For the metrics, if we handled the task end message in (b), it will be redundant, and cause wrong value, for example stage.executorSummary.taskTime will be added again, will need to excluded them in `Resubmitted` situation. And for the counters, the one need to be discussed is the counter `failedTasks`. I think it both make sense for the following two situations: 1. ( completedTasks, failedTasks, killedTasks ) = (2, 0, 0) , which means Tr success twice with no failure, actually resubmitted is a specific signal, it just rerun a success task, no task is actually failed and 2. ( completedTasks, failedTasks, killedTasks ) = (2, 1, 0) , which means Tr success twice and we think resubmitted is a failure. But it also will be confused with the tasks which really failed, like the tasks which is running when executor lost, these one is the really failed ones . Am I making sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org