Ngone51 commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333091552
########## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ########## @@ -103,6 +104,81 @@ private[spark] class AppStatusListener( } } + // visible for tests + private[spark] def recoverLiveEntities(): Unit = { + if (!live) { + kvstore.view(classOf[JobDataWrapper]) + .asScala.filter(_.info.status == JobExecutionStatus.RUNNING) + .map(_.toLiveJob).foreach(job => liveJobs.put(job.jobId, job)) + + kvstore.view(classOf[StageDataWrapper]).asScala + .filter { stageData => + stageData.info.status == v1.StageStatus.PENDING || + stageData.info.status == v1.StageStatus.ACTIVE + } + .map { stageData => + val stageId = stageData.info.stageId + val jobs = liveJobs.values.filter(_.stageIds.contains(stageId)).toSeq + stageData.toLiveStage(jobs) + }.foreach { stage => + val stageId = stage.info.stageId + val stageAttempt = stage.info.attemptNumber() + liveStages.put((stageId, stageAttempt), stage) + + kvstore.view(classOf[ExecutorStageSummaryWrapper]) + .index("stage") + .first(Array(stageId, stageAttempt)) + .last(Array(stageId, stageAttempt)) + .asScala + .map(_.toLiveExecutorStageSummary) + .foreach { esummary => + stage.executorSummaries.put(esummary.executorId, esummary) + if (esummary.isBlacklisted) { + stage.blackListedExecutors += esummary.executorId + liveExecutors(esummary.executorId).isBlacklisted = true + liveExecutors(esummary.executorId).blacklistedInStages += stageId + } + } + + + kvstore.view(classOf[TaskDataWrapper]) + .parent(Array(stageId, stageAttempt)) + .index(TaskIndexNames.STATUS) + .first(TaskState.RUNNING.toString) + .last(TaskState.RUNNING.toString) + .closeableIterator().asScala + .map(_.toLiveTask) + .foreach { task => + liveTasks.put(task.info.taskId, task) + stage.activeTasksPerExecutor(task.info.executorId) += 1 + } + stage.savedTasks.addAndGet(kvstore.count(classOf[TaskDataWrapper]).intValue()) + } + kvstore.view(classOf[ExecutorSummaryWrapper]).asScala.filter(_.info.isActive) Review comment: > We may want to restore deadExecutors for the same. (isActive == false) Actually, we never write dead executors into KVStore in a live(true) `AppStatusListener`. That is, because, in `AppStatusListener`, we can see the place where dead executors might be wrote into KVStore: https://github.com/apache/spark/blob/d2f21b019909e66bf49ad764b851b4a65c2438f8/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L857-L869 However, when you notice the event `SparkListenerStageExecutorMetrics`, you will find that this event is only generated from `EventLoggingListener` and wrote into event log file. And the event would be only used within SHS's replay. That means, method `onStageExecutorMetrics` would be only called in a non-live `AppStatusListener`. Samely, we don't have a chance to call `onStageExecutorMetrics ` in a live `AppStatusListener`. Due to this, we don't have a chance to write dead executors into KVStore. ================================================== Wait, wait. I just remember that in SPARK-28594, we'll do incremental replay in SHS side, which can be possible to write dead executors to KVStore. Let me recover dead executors, too. And I decide to leave my original comment there since I think this can be a tricky part and I want to let you know my thoughts as clear as possible. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org