[ 
https://issues.apache.org/jira/browse/SPARK-37355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37355:
------------------------------------

    Assignee: Apache Spark

> Avoid Block Manager registrations when Executor is shutting down
> ----------------------------------------------------------------
>
>                 Key: SPARK-37355
>                 URL: https://issues.apache.org/jira/browse/SPARK-37355
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Wan Kun
>            Assignee: Apache Spark
>            Priority: Minor
>
> *Note:* Similar to SPARK-34949 and SPARK-35011, BlockManager will reregister 
> itself while executor is shutting down.
> *Problem:*
> As describe in SPARK-35011, HeartbeatReceiver.expireDeadHosts() will not 
> clean those BlockManager if the executor is killed with reason Executor 
> heartbeat timed out. Executors could heartbeat timed out because of network 
> issue, or some other reason like SPARK-20977
> *Logs:*
> {code:java}
> // Driver Logs
> 21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36] 
> spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent 
> heartbeats: 350149 ms exceeds timeout 300000 ms
> 21/11/13 05:06:20,999 INFO [kill-executor-thread] 
> cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
> 21/11/13 05:06:21,000 INFO [kill-executor-thread] 
> cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be 
> killed is 3056
> 21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8] 
> yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s) 
> 3056.
> 21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler] 
> cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor 
> 3056 with reason Executor heartbeat timed out after 350149 ms
> 21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler] 
> cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor 
> heartbeat timed out after 350149 ms
> 21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler] 
> scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID 
> 245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056 
> exited caused by one of the running tasks) Reason: Executor heartbeat timed 
> out after 350149 ms
> 21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler] 
> scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID 
> 245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056 
> exited caused by one of the running tasks) Reason: Executor heartbeat timed 
> out after 350149 ms
> 21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster] 
> storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from 
> BlockManagerMaster.
> 21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster] 
> storage.BlockManagerMasterEndpoint:57 : Removing block manager 
> BlockManagerId(3056, executor_host, 30504, None)
> 21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop] 
> storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
> 21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster] 
> storage.BlockManagerMasterEndpoint:57 : Registering block manager 
> executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host, 
> 30504, None)
> // Executor Logs
> 21/11/13 05:06:21,004 INFO [dispatcher-Executor] 
> executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
> 21/11/13 05:06:22,215 INFO [block-manager-future-0] 
> storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056, 
> executor_host, 30504, None)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to