[ 
https://issues.apache.org/jira/browse/SPARK-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12419.
----------------------------------
    Resolution: Incomplete

> FetchFailed = false Executor lost should not allowed re-registered in 
> BlockManager Master again?
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12419
>                 URL: https://issues.apache.org/jira/browse/SPARK-12419
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: SuYan
>            Priority: Minor
>              Labels: bulk-closed
>
> In Yarn, I found a container was completed By YarnAllocator(the container was 
> killed by Yarn initiatively due to the disk error), and removed
> from BlockManagerMaster.
> But after 1 second, due to Yarn not kill it quickly, it re-register to 
> BlockManagerMaster... it looks like unreasonable
> I check the code:
> fetchFailed=true, it  was reasonable that allow the executor to re-register
> FetchFailed=false: heartbeat expire(call 
> sc.killExecutor)/CoarseGrainedSchedulerBackend.RemoveExecutor(), which 
> thought that executor will never come back/ MesosScheduler executor Lost(I 
> not familar with mesos executor Lost event, will allow the executor back 
> again)?  if all fetchFailed=false executorLost are all regard as not come 
> back.... may can prevent it from re-registering in BlockManagerMaster...
> Also, it may be a yarn logic improvement, the completedContainers should  be 
> very dead?
> Here the logs:
> 2015-12-14,10:25:00,647 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Completed container container_1435709042873_31294_01_208639 (state: COMPLETE, 
> exit status: -100)
> 2015-12-14,10:25:00,647 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Container marked as failed: container_1435709042873_31294_01_208639. Exit 
> status: -100. Diagnostics: Container released on a *lost* node
> 2015-12-14,10:25:00,667 ERROR 
> org.apache.spark.scheduler.cluster.YarnClusterScheduler: Lost executor 84 on 
> XX.XX.XX.109.bj: Yarn deallocated the executor 84 (container 
> container_1435709042873_31294_01_208639)
> 2015-12-14,10:25:00,667 INFO org.apache.spark.scheduler.TaskSetManager: 
> Re-queueing tasks for 84 from TaskSet 5.0
> 2015-12-14,10:25:00,670 INFO org.apache.spark.scheduler.ShuffleMapStage: 
> ShuffleMapStage 5 is now unavailable on executor 21 (1926/2600, false)
> 2015-12-14,10:25:00,674 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitted ShuffleMapTask(5, 504), so marking it as still running
> 2015-12-14,10:25:00,675 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitted ShuffleMapTask(5, 773), so marking it as still running
> 2015-12-14,10:25:00,676 INFO org.apache.spark.scheduler.DAGScheduler: 
> Executor lost: 84 (epoch 13)
> 2015-12-14,10:25:00,676 INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove 
> executor 84 from BlockManagerMaster.
> 2015-12-14,10:25:00,677 INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(84, XX.XX.XX.109.bj, 44528)
> 2015-12-14,10:25:00,677 INFO org.apache.spark.storage.BlockManagerMaster: 
> Removed 84 successfully in removeExecutor
> 2015-12-14,10:25:01,066 INFO 
> org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block 
> manager XX.XX.XX.109.bj:44528 with 706.7 MB RAM, BlockManagerId(84, 
> XX.XX.XX.109.bj, 44528)
> 2015-12-14,10:25:01,584 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> rdd_20_2278 in memory on XX.XX.XX.109.bj:44528 (size:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to