[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397271#comment-15397271
 ] 

SuYan edited comment on SPARK-15815 at 7/28/16 8:50 AM:
--------------------------------------------------------

Current temp solution is when all executor were 60s time-out, we will reset the 
numExecutorsTarget=0 to interrupt the hang balance. but it map go through a 
long time to reset if there is a long tail task.

and temp solution 2 is even oldTargetNumExecutor equals targetNumExecutor, 
still call client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, 
hostToLocalTaskCount), so as long as total-blacklist executor 60s-timeout, it 
will ask a new executor replacement.



was (Author: suyan):
Current temp solution is when all executor were 60s time-out, we will reset the 
numExecutorsTarget=0 to interrupt the hang balance. but it map go through a 
long time to reset if there is a long tail task.

and temp solution 2 is even oldTargetNumExecutor equals targetNumExecutor, 
still call client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, 
hostToLocalTaskCount), so as long as total-blacklist 60s-timeout, it will ask a 
new executor replacement.


> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -----------------------------------------------------------------
>
>                 Key: SPARK-15815
>                 URL: https://issues.apache.org/jira/browse/SPARK-15815
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: SuYan
>            Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to