squito commented on pull request #28848: URL: https://github.com/apache/spark/pull/28848#issuecomment-648856169
I've been thinking about this more, and in particular why there are two different epochs and what the meaning of both is. I tried out another change -- I just moved the `failedEpoch(execId) = currentEpoch` in `[removeExecutorAndUnregisterOutputs](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1943)` inside the `if(fileLost)`. With that change, all tests in `org.apache.spark.scheduler` pass, including your added test (and I confirmed that your test does indeed fail with the current code in master). The idea is this: the purpose of the epochs is to know what to do when you get a fetch failure. Epochs tell you if you've already dealt with an "equivalent" failure or not, and whether or not to consider more shuffle output as being lost. Without the shuffle service, you remove shuffle output when there is an executor failure OR there is a fetch failure. With the shuffle service, you only do it when there is a fetch failure. The code already has those two branches, controlled by `fileLost`. But, I think there is actually a problem there. I think I'm mostly repeating stuff @wypoon and @attilapiros said before -- I hope that restating it might help others out. but it also pointed out to me a lack of test coverage we have. Particularly that `blockManagerMaster.removeExecutor()` is only called once for multiple fetch failures (or a mix of executorLost + fetch failures). I'm thinking we should rename the epochs to `blockManagerFailureEpoch` and `externalShuffleFailureEpoch` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
