Imran Rashid created SPARK-15865:
------------------------------------

             Summary: Blacklist should not result in job hanging with less than 
4 executors
                 Key: SPARK-15865
                 URL: https://issues.apache.org/jira/browse/SPARK-15865
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 2.0.0
            Reporter: Imran Rashid
            Assignee: Imran Rashid


Currently when you turn on blacklisting with 
{{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than 
{{spark.task.maxFailures}} executors, you can end with a job "hung" after some 
task failures.

If some task fails regularly (say, due to error in user code), then the task 
will be blacklisted from the given executor.  It will then try another 
executor, and fail there as well.  However, after it has tried all available 
executors, the scheduler will simply stop trying to schedule the task anywhere. 
 The job doesn't fail, nor it does it succeed -- it simply waits.  Eventually, 
when the blacklist expires, the task will be scheduled again.  But that can be 
quite far in the future, and in the meantime the user just observes a stuck job.

Instead we should abort the stage (and fail any dependent jobs) as soon as we 
detect tasks that cannot be scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to