[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ]
Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:37 AM: ------------------------------------------------------------- the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > ---------------------------------------------------------- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.2.0 > Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 120000ms > .... > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org