[
https://issues.apache.org/jira/browse/SPARK-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lukas Waldmann updated SPARK-27228:
-----------------------------------
Description:
When using dynamic allocations after all jobs finishes spark delays for several
minutes before finally finishes. Log suggest that executors are not cleared up
properly.
{quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21
09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355
19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list
of executor(s) to be killed is 355 19/03/21 09:51:38 INFO
ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355.
19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because
it has been idle for 60 seconds (new desired total will be 65) 19/03/21
09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228.
19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21
09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from
BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None)
19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in
removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at
[http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/]
19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on
data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21
09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38
INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from
BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, 41186, None)
19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 successfully in
removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 346 on
datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21
09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39
INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on
data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21
09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com
killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240
(epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove
executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240,
data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327.
19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21
09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None)
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355.
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 355 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168
(epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on
data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 168 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in
removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332
with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21
09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed):
Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN
HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms
exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an
executor 346 (already removed): Executor heartbeat timed out after 172853 ms
19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent
heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 355 (already removed): Executor
heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168
(already removed): Executor heartbeat timed out after 174129 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent
heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 327 (already removed): Executor
heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240
(already removed): Executor heartbeat timed out after 177937 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent
heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 228 (already removed): Executor
heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting
down all executors 19/03/21 09:56:43 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping
SchedulerExtensionServices (serviceOption=None, services=List(), started=false)
19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore:
MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped
19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21
09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext:
Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster:
Final app status: SUCCEEDED, exitCode: 0 19/03/21 09:56:43 INFO
ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 19/03/21
09:56:43 INFO AMRMClientImpl: Waiting for application to be successfully
unregistered. 19/03/21 09:56:43 INFO ApplicationMaster: Deleting staging
directory
hdfs://bdpprodnn/user/waldmann/.sparkStaging/application_1552944783492_60353
19/03/21 09:56:44 INFO ShutdownHookManager: Shutdown hook called 19/03/21
09:56:44 INFO ShutdownHookManager: Deleting directory
/mnt/data01/hadoop/yarn/local/usercache/waldmann/appcache/application_1552944783492_60353/spark-713791f3-87ca-4234-96d5-9969850fcea9}}{quote}
was:
When using dynamic allocations after all jobs finishes spark delays for several
minutes before finally finishes. Log suggest that executors are not cleared up
properly.
{quote}19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21
09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355
19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list
of executor(s) to be killed is 355 19/03/21 09:51:38 INFO
ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355.
19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because
it has been idle for 60 seconds (new desired total will be 65) 19/03/21
09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228.
19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21
09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from
BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None)
19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in
removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at
http://data-04.bdp.gin.merck.com:44304 19/03/21 09:51:38 INFO
YarnClusterScheduler: Executor 228 on data-15.bdp.gin.merck.com killed by
driver. 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 346. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 346
(epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove
executor 346 from BlockManagerMaster. 19/03/21 09:51:38 INFO
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(346,
datanode-02.bdp.gin.merck.com, 41186, None) 19/03/21 09:51:38 INFO
BlockManagerMaster: Removed 346 successfully in removeExecutor 19/03/21
09:51:38 INFO YarnClusterScheduler: Executor 346 on
datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21
09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39
INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on
data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21
09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com
killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240
(epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove
executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240,
data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327.
19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21
09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None)
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21
09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355.
19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on
data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 355 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in
removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168
(epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on
data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
BlockManagerMasterEndpoint: Trying to remove executor 168 from
BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing
block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None)
19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in
removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332
with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21
09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed):
Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN
HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms
exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an
executor 346 (already removed): Executor heartbeat timed out after 172853 ms
19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent
heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 355 (already removed): Executor
heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168
(already removed): Executor heartbeat timed out after 174129 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent
heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 327 (already removed): Executor
heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout
120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240
(already removed): Executor heartbeat timed out after 177937 ms 19/03/21
09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent
heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
YarnClusterScheduler: Lost an executor 228 (already removed): Executor
heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting
to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend:
Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill
executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to
kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend:
Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO
YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21
09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist!
19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s)
to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting
down all executors 19/03/21 09:56:43 INFO
YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping
SchedulerExtensionServices (serviceOption=None, services=List(), started=false)
19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore:
MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped
19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21
09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext:
Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster:
Final app status: SUCCEEDED, exitCode: 0 19/03/21 09:56:43 INFO
ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 19/03/21
09:56:43 INFO AMRMClientImpl: Waiting for application to be successfully
unregistered. 19/03/21 09:56:43 INFO ApplicationMaster: Deleting staging
directory
hdfs://bdpprodnn/user/waldmann/.sparkStaging/application_1552944783492_60353
19/03/21 09:56:44 INFO ShutdownHookManager: Shutdown hook called 19/03/21
09:56:44 INFO ShutdownHookManager: Deleting directory
/mnt/data01/hadoop/yarn/local/usercache/waldmann/appcache/application_1552944783492_60353/spark-713791f3-87ca-4234-96d5-9969850fcea9{quote}
> Spark long delay on close, possible problem with killing executors
> ------------------------------------------------------------------
>
> Key: SPARK-27228
> URL: https://issues.apache.org/jira/browse/SPARK-27228
> Project: Spark
> Issue Type: Bug
> Components: Block Manager
> Affects Versions: 2.3.0
> Reporter: Lukas Waldmann
> Priority: Major
>
> When using dynamic allocations after all jobs finishes spark delays for
> several minutes before finally finishes. Log suggest that executors are not
> cleared up properly.
> {quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21
> 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355
> 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill
> executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual
> list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO
> ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355.
> 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355
> because it has been idle for 60 seconds (new desired total will be 65)
> 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling
> executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch
> 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove
> executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228,
> data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO
> BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21
> 09:51:38 INFO SparkUI: Stopped Spark web UI at
> [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/]
> 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on
> data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21
> 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38
> INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from
> BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint:
> Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com,
> 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346
> successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler:
> Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor
> 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446)
> 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332,
> data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO
> BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21
> 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com
> killed by driver. 19/03/21 09:51:39 INFO
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21
> 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com
> killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240
> (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to
> remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240,
> data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO
> BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor
> 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446)
> 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327,
> data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO
> YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by
> driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully
> in removeExecutor 19/03/21 09:51:39 INFO
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21
> 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com
> killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355
> (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to
> remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355,
> data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO
> BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor
> 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446)
> 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on
> data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO
> BlockManagerMasterEndpoint: Trying to remove executor 168 from
> BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint:
> Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833,
> None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in
> removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor
> 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21
> 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed):
> Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN
> HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms
> exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost
> an executor 346 (already removed): Executor heartbeat timed out after 172853
> ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no
> recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26
> ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor
> heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
> Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout
> 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168
> (already removed): Executor heartbeat timed out after 174129 ms 19/03/21
> 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent
> heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
> YarnClusterScheduler: Lost an executor 327 (already removed): Executor
> heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver:
> Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout
> 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240
> (already removed): Executor heartbeat timed out after 177937 ms 19/03/21
> 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent
> heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR
> YarnClusterScheduler: Lost an executor 228 (already removed): Executor
> heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:54:26 INFO
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of
> executor(s) to be killed is 19/03/21 09:56:43 INFO
> YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut
> down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping
> SchedulerExtensionServices (serviceOption=None, services=List(),
> started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint:
> MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore:
> MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped
> 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped
> 19/03/21 09:56:43 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext:
> Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster:
> Final app status: SUCCEEDED, exitCode: 0 19/03/21 09:56:43 INFO
> ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 19/03/21
> 09:56:43 INFO AMRMClientImpl: Waiting for application to be successfully
> unregistered. 19/03/21 09:56:43 INFO ApplicationMaster: Deleting staging
> directory
> hdfs://bdpprodnn/user/waldmann/.sparkStaging/application_1552944783492_60353
> 19/03/21 09:56:44 INFO ShutdownHookManager: Shutdown hook called 19/03/21
> 09:56:44 INFO ShutdownHookManager: Deleting directory
> /mnt/data01/hadoop/yarn/local/usercache/waldmann/appcache/application_1552944783492_60353/spark-713791f3-87ca-4234-96d5-9969850fcea9}}{quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]