Sun Rui created SPARK-17519:
-------------------------------
Summary: [MESOS] Enhance robustness when ExternalShuffleService is
broken
Key: SPARK-17519
URL: https://issues.apache.org/jira/browse/SPARK-17519
Project: Spark
Issue Type: Improvement
Components: Mesos
Affects Versions: 2.0.0
Reporter: Sun Rui
This is intended to be a complement to SPARK-17370 which addressed Standalone
mode only.
For Mesos, it seems we could enhance MesosExternalShuffleClient to detect if
any of the external shuffle services is lost when sending heartbeats. In such
case, the MesosCoarseGrainedSchedulerBackend can notify ExecutorLost with
workerlost=true. Also it can put the slave where the external shuffle service
run to the blacklist, preventing launching tasks further on it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]