Timur Abakumov created SPARK-19755:
--------------------------------------

             Summary: Blacklist is always active for 
MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an 
executor after some time.
                 Key: SPARK-19755
                 URL: https://issues.apache.org/jira/browse/SPARK-19755
             Project: Spark
          Issue Type: Bug
          Components: Mesos, Scheduler
    Affects Versions: 2.1.0
         Environment: mesos, marathon, docker - driver and executors are 
dockerized.
            Reporter: Timur Abakumov


When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased 
failure counter for a slave where that task was running.
When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
Over time  scheduler cannot create a new executor - every slave is is in the 
blacklist.  Task failure not necessary related to host health- especially for 
long running stream apps.
If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
make that functionality optional and if it make sense   MAX_SLAVE_FAILURES also 
can be configurable.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to