[
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
]
Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:27 AM:
-------------------------------------------------------------
the phenomenon is:
blockManagerSlave is timeout and BlockManagerMasterActor will remove this
blockManager, but executor on this blockManager is not timeout because akka's
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen
onBlockManagerRemoved and remove this executor. but actually in
CoarseGrainedSchedulerBackend it is still in executorDataMap. At this time it
is wrong.
[~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout
of BlockManager, we need to check whether executor on this blockmanager has
been removed. if its executor has not been removed, we should firstly remove
this executor. how about this way to solve this problem?
was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout and BlockManagerMasterActor will remove this
blockManager, but executor on this blockManager is not timeout because akka's
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen
onBlockManagerRemoved and remove this executor. but actually in
CoarseGrainedSchedulerBackend it is still in executorDataMap. at this time it
is wrong.
[~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout
of BlockManager, we need to check whether executor on this blockmanager has
been removed. if its executor has not been removed, we should firstly remove
this executor. how about this way to solve this problem?
> Executor is still hold while BlockManager has been removed
> ----------------------------------------------------------
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0
> Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is
> removed by driver, but after half an hour before the executor is remove by
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms
> exceeds 120000ms
> ....
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://[email protected]:46182] has failed, address is
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3,
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in
> removeExecutor
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]