wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r453856811



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,34 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
-  // every task. When we detect a node failing, we note the current epoch 
number and failed
-  // executor, increment it for new tasks, and use this to ignore stray 
ShuffleMapTask results.
-  //
-  // TODO: Garbage collect information about failure epochs when we know there 
are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  /**
+   * Tracks the latest epoch of a fully processed error related to the given 
executor. (We use
+   * the MapOutputTracker's epoch number, which is sent with every task.)
+   *
+   * When an executor fails, it can affect the results of many tasks, and we 
have to deal with
+   * all of them consistently. We don't simply ignore all future results from 
that executor,
+   * as the failures may have been transient; but we also don't want to 
"overreact" to follow-
+   * on errors we receive. Furthermore, we might receive notification of a 
task success, after
+   * we find out the executor has actually failed; we'll assume those 
successes are, in fact,
+   * simply delayed notifications and the results have been lost, if the tasks 
started in the
+   * same or an earlier epoch. In particular, we use this to control when we 
tell the
+   * BlockManagerMaster that the BlockManager has been lost.
+   */
+  private val executorFailureEpoch = new HashMap[String, Long]
+
+  /**
+   * Tracks the latest epoch of a fully processed error where shuffle files 
have been lost from
+   * the given executor.
+   *
+   * This is closely related to executorFailureEpoch.
+   * They only differ for the executor when there is a Standalone worker or an 
external shuffle

Review comment:
       Clarified the comment. It was not quite right. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to