[
https://issues.apache.org/jira/browse/SPARK-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid resolved SPARK-15865.
----------------------------------
Resolution: Fixed
Fix Version/s: 2.1.0
Issue resolved by pull request 13603
[https://github.com/apache/spark/pull/13603]
> Blacklist should not result in job hanging with less than 4 executors
> ---------------------------------------------------------------------
>
> Key: SPARK-15865
> URL: https://issues.apache.org/jira/browse/SPARK-15865
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.0.0
> Reporter: Imran Rashid
> Assignee: Imran Rashid
> Fix For: 2.1.0
>
>
> Currently when you turn on blacklisting with
> {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than
> {{spark.task.maxFailures}} executors, you can end with a job "hung" after
> some task failures.
> If some task fails regularly (say, due to error in user code), then the task
> will be blacklisted from the given executor. It will then try another
> executor, and fail there as well. However, after it has tried all available
> executors, the scheduler will simply stop trying to schedule the task
> anywhere. The job doesn't fail, nor it does it succeed -- it simply waits.
> Eventually, when the blacklist expires, the task will be scheduled again.
> But that can be quite far in the future, and in the meantime the user just
> observes a stuck job.
> Instead we should abort the stage (and fail any dependent jobs) as soon as we
> detect tasks that cannot be scheduled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]