[GitHub] [spark] wineternity commented on a diff in pull request #38702: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

GitBox Wed, 30 Nov 2022 00:05:36 -0800


wineternity commented on code in PR #38702:
URL: https://github.com/apache/spark/pull/38702#discussion_r1035584393



##########
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala:
##########
@@ -645,8 +645,11 @@ private[spark] class AppStatusListener(
   }
 
   override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
-    // TODO: can this really happen?
-    if (event.taskInfo == null) {
+    // TODO: can taskInfo null really happen?
+    // For resubmitted tasks caused by ExecutorLost, the SparkListenerTaskEnd 
is useless and
+    // will make activeTask in stage to be negative, this will cause stage not 
be removed in
+    // liveStages, and finally cause executor not removed in deadExecutors
+    if (event.taskInfo == null ||  event.reason == Resubmitted) {

Review Comment:
   > Couple of things to clarify (the pr description/comments confused me a 
bit).
   > 
   > a) There was a pair of task start and task end event which were fired for 
the task (let us call it Tr) b) When executor which ran Tr was lost, while 
stage is still running, a `Resubmitted` is fired for Tr. c) Subsequently, a new 
task start and task end will be fired for the retry of Tr.
   > 
   > The issue here is, (b) is associated with (a) - which was a success 
earlier, but has now been marked as a failure.
   > 
   > We should not be ignoring this event, but rather deal with it properly. 
For example, update counters, number of failed tasks, etc.
   
   The steps is right， and in my opinion the task end event in (b) with the 
`Resubmitted` reason can be ignored in AppStatusListener, ignore it also make 
the logic clear. 
   
    In (a), the counter and metric is already recorded for Tr. eg. the 
stage.completedTasks, stage.job.completedTasks, stage.executorSummary.taskTime 
and so on.
    In (b), we got an extra task end with the "Resubmitted" reason for Tr.  
thus Tr has one start event with two end event. The activeTask counter in stage 
will be wrong, which is the root cause for this jira.
    In (c), the counter and metric will be recorded for the retry of Tr.
   
    For the metrics, if we handled the task end message in (b), it will be 
redundant, and cause wrong value, for example stage.executorSummary.taskTime 
will be added again, will need to excluded them in `Resubmitted` situation.
   
    And for the counters, the one need to be discussed is the counter 
`failedTasks`.  I think it both make sense for the following two situations: 
   1. ( completedTasks, failedTasks, killedTasks ) = (2, 0, 0) , which means Tr 
success twice with no failure, actually resubmitted is a specific signal, it 
just rerun a success task, no task is actually failed
   and 
   2. ( completedTasks, failedTasks, killedTasks ) = (2, 1, 0) , which means Tr 
success twice and we think resubmitted is a failure.  But it also will be 
confused with the tasks which really failed, like the tasks which is running 
when executor lost, these one is the really failed ones . 
   
   Am I making sense？ 
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wineternity commented on a diff in pull request #38702: [SPARK-41187][CORE] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

Reply via email to