Wan Kun created SPARK-37355:
-------------------------------
Summary: Avoid Block Manager registrations when Executor is
shutting down
Key: SPARK-37355
URL: https://issues.apache.org/jira/browse/SPARK-37355
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.2.0
Reporter: Wan Kun
*Note:* Similar to SPARK-34949 and SPARK-35011, BlockManager will reregister
itself while executor is shutting down.
*Problem:*
As describe in SPARK-35011, HeartbeatReceiver.expireDeadHosts() will not clean
those BlockManager if the executor is killed with reason Executor heartbeat
timed out. Executors could heartbeat timed out because of network issue, or
some other reason like SPARK-20977
*Logs:*
{code:java}
// Driver Logs
21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36]
spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent heartbeats:
350149 ms exceeds timeout 300000 ms
21/11/13 05:06:20,999 INFO [kill-executor-thread]
cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
21/11/13 05:06:21,000 INFO [kill-executor-thread]
cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be killed
is 3056
21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8]
yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s)
3056.
21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler]
cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor
3056 with reason Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler]
cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor
heartbeat timed out after 350149 ms
21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler]
scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID
245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056
exited caused by one of the running tasks) Reason: Executor heartbeat timed out
after 350149 ms
21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler]
scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID
245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056
exited caused by one of the running tasks) Reason: Executor heartbeat timed out
after 350149 ms
21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster]
storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from
BlockManagerMaster.
21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster]
storage.BlockManagerMasterEndpoint:57 : Removing block manager
BlockManagerId(3056, executor_host, 30504, None)
21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop]
storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster]
storage.BlockManagerMasterEndpoint:57 : Registering block manager
executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host,
30504, None)
// Executor Logs
21/11/13 05:06:21,004 INFO [dispatcher-Executor]
executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
21/11/13 05:06:22,215 INFO [block-manager-future-0]
storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056,
executor_host, 30504, None)
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]