Timur Abakumov created SPARK-19755:
--------------------------------------
Summary: Blacklist is always active for
MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an
executor after some time.
Key: SPARK-19755
URL: https://issues.apache.org/jira/browse/SPARK-19755
Project: Spark
Issue Type: Bug
Components: Mesos, Scheduler
Affects Versions: 2.1.0
Environment: mesos, marathon, docker - driver and executors are
dockerized.
Reporter: Timur Abakumov
When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased
failure counter for a slave where that task was running.
When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.
Over time scheduler cannot create a new executor - every slave is is in the
blacklist. Task failure not necessary related to host health- especially for
long running stream apps.
If accepted as a bug: possible solution is to use: spark.blacklist.enabled to
make that functionality optional and if it make sense MAX_SLAVE_FAILURES also
can be configurable.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]