Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
One thing which I noticed while making sense of what was going in the code
(even before) -- IIRC, spark standalone is a bit of a special case. I think it
used to be the case that to run multiple executors per node, you had to run
multiple worker instances on the node. Eg., see mentions of
SPARK_WORKER_INSTANCES here:
http://spark.apache.org/docs/1.4.0/spark-standalone.html
which is gone in the latest docs:
http://spark.apache.org/docs/latest/spark-standalone.html
but though its not documented, I think you can in fact still use multiple
worker instances per node:
https://github.com/apache/spark/blob/master/sbin/start-slave.sh#L88
which means, that when we get the WorkerLost msg in spark standalone, we
aren't really sure if all shuffle files on that host have been lost or not:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1694
But I *think* the consistent thing to do would be to assume that there is
just one worker per node, as that is the latest recommended configuration, and
go ahead and remove all shuffle files on the node if the external shuffle
service is enabled. Which would mean that we'd want to change the handling of
the `ExecutorLost` as well to pass along the host.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]