[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

GitBox Fri, 26 Jun 2020 16:05:40 -0700


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446445843




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
-  // every task. When we detect a node failing, we note the current epoch 
number and failed
-  // executor, increment it for new tasks, and use this to ignore stray 
ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given 
executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there 
are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we 
have to deal with
+  // all of them consistently. We don't simply ignore all future results from 
that executor,
+  // as the failures may have been transient; but we also don't want to 
"overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a 
task success, after
+  // we find out the executor has actually failed; we'll assume those 
successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come 
from the same
+  // epoch. In particular, we use this to control when we tell the 
BlockManagerMaster that the
+  // BlockManager has been lost.
+  private val executorFailureEpoch = new HashMap[String, Long]
+  // Tracks the latest epoch of a fully processed error where shuffle files 
have been lost from

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Reply via email to