[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:37 AM:
-------------------------------------------------------------

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem? 



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?


> Executor is still hold while BlockManager has been removed
> ----------------------------------------------------------
>
>                 Key: SPARK-5529
>                 URL: https://issues.apache.org/jira/browse/SPARK-5529
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 120000ms
> ....
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to