wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647766743


   > Ideally if the node is lost, we should have `SlaveLost` as the executor 
loss reason, could you explain more about why doesn't this happen in your use 
case? Thank you!
   
   Only Spark Standalone uses `SlaveLost` with `workerLost=true`. Neither the 
YARN nor Mesos backend uses that. Our customer uses Spark on YARN with the 
external shuffle service enabled. When the executor is lost, 
`DAGScheduler#handleExecutorLost` is called with `workerLost=false`. Since
   ```
       // if the cluster manager explicitly tells us that the entire worker was 
lost, then
       // we know to unregister shuffle output.  (Note that "worker" 
specifically refers to the process
       // from a Standalone cluster, where the shuffle service lives in the 
Worker.)
       val fileLost = workerLost || 
!env.blockManager.externalShuffleServiceEnabled
   ```
   (note the comment there referring to Spark Standalone) 
`removeExecutorAndUnregisterOutputs` is then called with `fileLost=false`.
   `DAGScheduler#handleExecutorLost` is called from 
   ```
       case ExecutorLost(execId, reason) =>
         val workerLost = reason match {
           case SlaveLost(_, true) => true
           case _ => false
         }
         dagScheduler.handleExecutorLost(execId, workerLost)
   ```
   As I said, only Spark Standalone creates `SlaveLost(_, true)` instances.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to