[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

GitBox Tue, 23 Jun 2020 14:06:24 -0700


squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444508358




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there 
are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file 
output

Review comment:
       yeah I really agree with this.  I think we need to rename both of these 
epochs.  I think I finally get why they both matter. One tracks files owned by 
the external shuffle service -- so you don't update those files when you learn 
that an executor is lost.  The other tracks files owned by the executor itself. 
 So that includes memory-cached blocks always, and also shuffle files w/out the 
external shuffle service.
   
   Once I wrapped my head around that part, the change in the logic made sense.
   
   I will think some more about better names and comments which would help with 
that part




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Reply via email to