Looking at the code a bit more, it appears that blacklisting is disabled by default. To enable it, set spark.blacklist.enabled=true.
The updates in 2.1.0 appear to provide much more fine-grained settings for this, like the number of tasks that can fail before an executor is blacklisted for a stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <rb...@netflix.com> wrote: > Chawla, > > We hit this issue, too. I worked around it by setting spark.scheduler. > executorTaskBlacklistTime=5000. The problem for us was that the scheduler > was using locality to select the executor, even though it had already > failed there. The executor task blacklist time controls how long the > scheduler will avoid using an executor for a failed task, which will cause > it to avoid rescheduling on the executor. The default was 0, so the > executor was put back into consideration immediately. > > In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not > sure if that does exactly the same thing. The default for that setting is > 1h instead of 0. It’s better to have a non-zero default to avoid what > you’re seeing. > > rb > > > On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit <sumitkcha...@gmail.com> > wrote: > >> I am seeing a strange issue. I had a bad behaving slave that failed the >> entire job. I have set spark.task.maxFailures to 8 for my job. Seems >> like all task retries happen on the same slave in case of failure. My >> expectation was that task will be retried on different slave in case of >> failure, and chance of all 8 retries to happen on same slave is very less. >> >> >> Regards >> Sumit Chawla >> >> > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix