Lukas Waldmann created SPARK-27228:
--------------------------------------
Summary: Spark long delay on close, possible problem with killing
executors
Key: SPARK-27228
URL: https://issues.apache.org/jira/browse/SPARK-27228
Project: Spark
Issue Type: Bug
Components: Block Manager
Affects Versions: 2.3.0
Reporter: Lukas Waldmann
When using dynamic allocations after all jobs finishes spark delays for several
minutes before finally finishes. Log suggest that executors are not cleared up
properly.
{quote}19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21
09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355
19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list
of executor(s) to be killed is 355 19/03/21 09:51:38 INFO
ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355.
19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because
it has been idle for 60 seconds (new desired total will be 65) 19/03/21
09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228.
19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21
09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from
BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None)
19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in
removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at
http://data-04.bdp.gin.merck.com:44304 19/03/21 09:51:38 INFO
YarnClusterScheduler: Executor 228 on data-15.bdp.gin.merck.com killed by
driver. 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 346. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 346
(epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove
executor 346 from BlockManagerMaster. 19/03/21 09:51:38 INFO
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(346,
datanode-02.bdp.gin.merck.com, 41186, None) 19/03/21 09:51:38 INFO
BlockManagerMaster: Removed 346 successfully in removeExecutor 19/03/21
09:51:38 INFO YarnClusterScheduler: Executor 346 on
datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21
09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39
INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on
data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21
09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com
killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240
(epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove
executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240,
data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327.
19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21
09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None)
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355.
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 355 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168
(epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on
data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 168 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in
removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332
with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21
09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed):
Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN
HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms
exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an
executor 346 (already removed): Executor heartbeat timed out after 172853 ms
19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent
heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 355 (already removed): Executor
heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168
(already removed): Executor heartbeat timed out after 174129 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent
heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 327 (already removed): Executor
heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240
(already removed): Executor heartbeat timed out after 177937 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent
heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 228 (already removed): Executor
heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting
down all executors 19/03/21 09:56:43 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping
SchedulerExtensionServices (serviceOption=None, services=List(), started=false)
19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore:
MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped
19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21
09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext:
Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster:
Final app status: SUCCEEDED, exitCode: 0 19/03/21 09:56:43 INFO
ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 19/03/21
09:56:43 INFO AMRMClientImpl: Waiting for application to be successfully
unregistered. 19/03/21 09:56:43 INFO ApplicationMaster: Deleting staging
directory
hdfs://bdpprodnn/user/waldmann/.sparkStaging/application_1552944783492_60353
19/03/21 09:56:44 INFO ShutdownHookManager: Shutdown hook called 19/03/21
09:56:44 INFO ShutdownHookManager: Deleting directory
/mnt/data01/hadoop/yarn/local/usercache/waldmann/appcache/application_1552944783492_60353/spark-713791f3-87ca-4234-96d5-9969850fcea9{quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]