squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648856169


   I've been thinking about this more, and in particular why there are two 
different epochs and what the meaning of both is.   I tried out another change 
-- I just moved the `failedEpoch(execId) = currentEpoch` in 
`[removeExecutorAndUnregisterOutputs](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1943)`
 inside the `if(fileLost)`.  With that change, all tests in 
`org.apache.spark.scheduler` pass, including your added test (and I confirmed 
that your test does indeed fail with the current code in master).
   
   The idea is this:
   the purpose of the epochs is to know what to do when you get a fetch 
failure.  Epochs tell you if you've already dealt with an "equivalent" failure 
or not, and whether or not to consider more shuffle output as being lost.  
Without the shuffle service, you remove shuffle output when there is an 
executor failure OR there is a fetch failure.  With the shuffle service, you 
only do it when there is a fetch failure.  The code already has those two 
branches, controlled by `fileLost`.
   
   But, I think there is actually a problem there.  
   
   I think I'm mostly repeating stuff @wypoon and @attilapiros said before -- I 
hope that restating it might help others out.  but it also pointed out to me a 
lack of test coverage we have.  Particularly that 
`blockManagerMaster.removeExecutor()` is only called once for multiple fetch 
failures (or a mix of executorLost + fetch failures).
   
   I'm thinking we should rename the epochs to `blockManagerFailureEpoch` and 
`externalShuffleFailureEpoch`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to