Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    One thing which I noticed while making sense of what was going in the code 
(even before) -- IIRC, spark standalone is a bit of a special case.  I think it 
used to be the case that to run multiple executors per node, you had to run 
multiple worker instances on the node.  Eg., see mentions of 
SPARK_WORKER_INSTANCES here: 
http://spark.apache.org/docs/1.4.0/spark-standalone.html
    which is gone in the latest docs: 
http://spark.apache.org/docs/latest/spark-standalone.html
    
    but though its not documented, I think you can in fact still use multiple 
worker instances per node: 
    https://github.com/apache/spark/blob/master/sbin/start-slave.sh#L88
    
    which means, that when we get the WorkerLost msg in spark standalone, we 
aren't really sure if all shuffle files on that host have been lost or not:
    
    
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1694
    
    But I *think* the consistent thing to do would be to assume that there is 
just one worker per node, as that is the latest recommended configuration, and 
go ahead and remove all shuffle files on the node if the external shuffle 
service is enabled.  Which would mean that we'd want to change the handling of 
the `ExecutorLost` as well to pass along the host.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to