[
https://issues.apache.org/jira/browse/SPARK-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-15865:
------------------------------------
Assignee: Apache Spark (was: Imran Rashid)
> Blacklist should not result in job hanging with less than 4 executors
> ---------------------------------------------------------------------
>
> Key: SPARK-15865
> URL: https://issues.apache.org/jira/browse/SPARK-15865
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.0.0
> Reporter: Imran Rashid
> Assignee: Apache Spark
>
> Currently when you turn on blacklisting with
> {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than
> {{spark.task.maxFailures}} executors, you can end with a job "hung" after
> some task failures.
> If some task fails regularly (say, due to error in user code), then the task
> will be blacklisted from the given executor. It will then try another
> executor, and fail there as well. However, after it has tried all available
> executors, the scheduler will simply stop trying to schedule the task
> anywhere. The job doesn't fail, nor it does it succeed -- it simply waits.
> Eventually, when the blacklist expires, the task will be scheduled again.
> But that can be quite far in the future, and in the meantime the user just
> observes a stuck job.
> Instead we should abort the stage (and fail any dependent jobs) as soon as we
> detect tasks that cannot be scheduled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]