[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225375#comment-16225375
 ] 

Juan Rodríguez Hortalá commented on SPARK-22148:
------------------------------------------------

Hi [~irashid]. This looks like a different problem, because this issue is about 
a crash due to job aborted because there is no place to schedule a task, and 
SPARK-15815 is about a hang. But I have seen hangs similar to the one described 
in SPARK-15815 in the past, also related to dynamic allocation, so it looks 
like the root cause could be related. \

My proposal is similar to some of the ideas you outline in SPARK-15815. The 
main difference is that I don't suggest killing an executor, but requesting 
more executors to the resource manager. The result is similar, but your 
approach would work even if no more capacity is available. On the other hand my 
approach won't kill an executor that is progressing in other tasks. However my 
approach won't work if 1) there are no more executors available in the cluster, 
and 2) the executor timeout if very long, or executors are caching RDDs and the 
default timeout of infinite, as I was expecting to cover the case of no more 
capacity available by assuming an executor will eventually become idle. Killing 
an executor has no terrible consequences because with dynamic allocation we 
probably have external shuffle, so I think the approach you propose in 
SPARK-15815 is a better alternative. 

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22148
>                 URL: https://issues.apache.org/jira/browse/SPARK-22148
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Juan Rodríguez Hortalá
>         Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to