[ 
https://issues.apache.org/jira/browse/SPARK-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798209#comment-16798209
 ] 

Lukas Waldmann commented on SPARK-27228:
----------------------------------------

Startup parameters:

spark-submit --conf spark.shuffle.service.enabled=true --conf 
spark.dynamicAllocation.enabled=true --conf spark.driver.maxResultSize=4g 
--executor-memory 4g --driver-memory 8g --master yarn --deploy-mode cluster

> Spark long delay on close, possible problem with killing executors
> ------------------------------------------------------------------
>
>                 Key: SPARK-27228
>                 URL: https://issues.apache.org/jira/browse/SPARK-27228
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.3.0
>            Reporter: Lukas Waldmann
>            Priority: Major
>         Attachments: log.html
>
>
> When using dynamic allocations after all jobs finishes spark delays for 
> several minutes before finally finishes. Log suggest that executors are not 
> cleared up properly.
> {quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21 
> 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355 
> 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill 
> executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual 
> list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO 
> ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355. 
> 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 
> because it has been idle for 60 seconds (new desired total will be 65) 
> 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 
> 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228, 
> data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO 
> BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21 
> 09:51:38 INFO SparkUI: Stopped Spark web UI at 
> [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/]
>  19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on 
> data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21 
> 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38 
> INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from 
> BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: 
> Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, 
> 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 
> successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: 
> Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 
> 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 
> 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332, 
> data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO 
> BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21 
> 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com 
> killed by driver. 19/03/21 09:51:39 INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21 
> 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com 
> killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240 
> (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to 
> remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240, 
> data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO 
> BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21 
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 
> 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 
> 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327, 
> data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO 
> YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by 
> driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully 
> in removeExecutor 19/03/21 09:51:39 INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21 
> 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com 
> killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355 
> (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to 
> remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355, 
> data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO 
> BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21 
> 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 
> 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446) 
> 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on 
> data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO 
> BlockManagerMasterEndpoint: Trying to remove executor 168 from 
> BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: 
> Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, 
> None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in 
> removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 
> 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21 
> 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed): 
> Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN 
> HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms 
> exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost 
> an executor 346 (already removed): Executor heartbeat timed out after 172853 
> ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no 
> recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 
> ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor 
> heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: 
> Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout 
> 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168 
> (already removed): Executor heartbeat timed out after 174129 ms 19/03/21 
> 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent 
> heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR 
> YarnClusterScheduler: Lost an executor 327 (already removed): Executor 
> heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: 
> Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout 
> 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240 
> (already removed): Executor heartbeat timed out after 177937 ms 19/03/21 
> 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent 
> heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR 
> YarnClusterScheduler: Lost an executor 228 (already removed): Executor 
> heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:54:26 INFO 
> YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21 
> 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not 
> exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 19/03/21 09:56:43 INFO 
> YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43 
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut 
> down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping 
> SchedulerExtensionServices (serviceOption=None, services=List(), 
> started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore: 
> MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped 
> 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 
> 19/03/21 09:56:43 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext: 
> Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster: 
> Final app status: SUCCEEDED, exitCode: 0 19/03/21 }}
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to