[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715719#comment-15715719
 ] 

Imran Rashid commented on SPARK-15815:
--------------------------------------

[~SuYan] I've been mulling this over for a while, and I think my earlier 
proposal is good one.  We'd need two changes:

1. When unschedulability is detected, kill an executor that is blacklisted for 
the unschedulable task and request another one.
2. When we detect unschedulability due to blacklisting, instead of immediately 
killing the taskset, we should add a start a countdown (say 5 min).  If the 
taskset remains unschedulable until the countdown is up, then we abort the 
taskset.

In the case you outline above, this should have the desired effect.  When DA 
has you down to just one executor for the last task, then that executor gets 
blacklisted, you'd kill the executor, and simultaneously start the countdown.  
Hopefully the cluster manager gives you another executor before the countdown 
is up, and then your job continues happily.

Two other situations worth considering: (a) the cluster manager gives us 
another executor on a bad node.  Tasks fail on this new executor, which again 
gets blacklisted.  I think this is OK.  The countdown would get reset when we 
schedule the task on the new executor, even though the task will fail.  Then 
when the executor gets blacklisted, 

(b) the cluster manager fails to give you another executor before the timeout 
is up.  We could either abort the job, or just let the app hang indefinitely 
(eg., ignore the countdown in this specific case that there aren't any 
executors).  In fact, the [code already lets the app wait indefinitely if tehre 
are no 
executors|https://github.com/apache/spark/blob/48778976e0566d9c93a8c900825def82c6b81fd6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L594].

Note that SPARK-16554 , automated killing of blacklisted executors, is related, 
but is insufficient to handle (1) above.  SPARK-16554 will only kill an 
executor that is blacklisted for the entire application, but in this case we 
need to kill an executor that is blacklisted even for just one task.

An alternative to actively killing the executor would be to somehow inform the 
the {{ExecutorAllocationManager}} that we have a task which *cannot* be 
scheduled on the existing executors, so it requests a new executor and leaves 
the old one.  However, that makes the implementation significantly more 
complex.  Though it would be more efficient, I think we should keep things 
simpler and live with a bit of inefficiency in this case?

Thoughts?  Any interest in taking a stab at implementing this?

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -----------------------------------------------------------------
>
>                 Key: SPARK-15815
>                 URL: https://issues.apache.org/jira/browse/SPARK-15815
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: SuYan
>            Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to