squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444508358
##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
// TODO: Garbage collect information about failure epochs when we know there
are no more
// stray messages to detect.
private val failedEpoch = new HashMap[String, Long]
+ // In addition, track epoch for failed executors that result in lost file
output
Review comment:
yeah I really agree with this. I think we need to rename both of these
epochs. I think I finally get why they both matter. One tracks files owned by
the external shuffle service -- so you don't update those files when you learn
that an executor is lost. The other tracks files owned by the executor itself.
So that includes memory-cached blocks always, and also shuffle files w/out the
external shuffle service.
Once I wrapped my head around that part, the change in the logic made sense.
I will think some more about better names and comments which would help with
that part
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]