[
https://issues.apache.org/jira/browse/SPARK-37355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-37355:
------------------------------------
Assignee: Apache Spark
> Avoid Block Manager registrations when Executor is shutting down
> ----------------------------------------------------------------
>
> Key: SPARK-37355
> URL: https://issues.apache.org/jira/browse/SPARK-37355
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.0
> Reporter: Wan Kun
> Assignee: Apache Spark
> Priority: Minor
>
> *Note:* Similar to SPARK-34949 and SPARK-35011, BlockManager will reregister
> itself while executor is shutting down.
> *Problem:*
> As describe in SPARK-35011, HeartbeatReceiver.expireDeadHosts() will not
> clean those BlockManager if the executor is killed with reason Executor
> heartbeat timed out. Executors could heartbeat timed out because of network
> issue, or some other reason like SPARK-20977
> *Logs:*
> {code:java}
> // Driver Logs
> 21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36]
> spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent
> heartbeats: 350149 ms exceeds timeout 300000 ms
> 21/11/13 05:06:20,999 INFO [kill-executor-thread]
> cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
> 21/11/13 05:06:21,000 INFO [kill-executor-thread]
> cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be
> killed is 3056
> 21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8]
> yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s)
> 3056.
> 21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler]
> cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor
> 3056 with reason Executor heartbeat timed out after 350149 ms
> 21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler]
> cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor
> heartbeat timed out after 350149 ms
> 21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler]
> scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID
> 245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056
> exited caused by one of the running tasks) Reason: Executor heartbeat timed
> out after 350149 ms
> 21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler]
> scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID
> 245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056
> exited caused by one of the running tasks) Reason: Executor heartbeat timed
> out after 350149 ms
> 21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster]
> storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from
> BlockManagerMaster.
> 21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster]
> storage.BlockManagerMasterEndpoint:57 : Removing block manager
> BlockManagerId(3056, executor_host, 30504, None)
> 21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop]
> storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
> 21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster]
> storage.BlockManagerMasterEndpoint:57 : Registering block manager
> executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host,
> 30504, None)
> // Executor Logs
> 21/11/13 05:06:21,004 INFO [dispatcher-Executor]
> executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
> 21/11/13 05:06:22,215 INFO [block-manager-future-0]
> storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056,
> executor_host, 30504, None)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]