[
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397271#comment-15397271
]
SuYan edited comment on SPARK-15815 at 7/28/16 8:50 AM:
--------------------------------------------------------
Current temp solution is when all executor were 60s time-out, we will reset the
numExecutorsTarget=0 to interrupt the hang balance. but it map go through a
long time to reset if there is a long tail task.
and temp solution 2 is even oldTargetNumExecutor equals targetNumExecutor,
still call client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,
hostToLocalTaskCount), so as long as total-blacklist executor 60s-timeout, it
will ask a new executor replacement.
was (Author: suyan):
Current temp solution is when all executor were 60s time-out, we will reset the
numExecutorsTarget=0 to interrupt the hang balance. but it map go through a
long time to reset if there is a long tail task.
and temp solution 2 is even oldTargetNumExecutor equals targetNumExecutor,
still call client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,
hostToLocalTaskCount), so as long as total-blacklist 60s-timeout, it will ask a
new executor replacement.
> Hang while enable blacklistExecutor and DynamicExecutorAllocator
> -----------------------------------------------------------------
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 1.6.1
> Reporter: SuYan
> Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor
> are all timeout.
> 2. the task failed, so task will not scheduled in current Executor A due to
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1
> executors, due to we already have executor A, so the oldTargetNumExecutor ==
> targetNumExecutor = 1, so will never add more Executors...even if Executor A
> was timeout. it became endless request delta=0 executors.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]