wankunde edited a comment on pull request #34536:
URL: https://github.com/apache/spark/pull/34536#issuecomment-971211171


   > For such registered `BlockManager`s, fortunately, we have 
`HeartbeatReceiver.expireDeadHosts` to remove them in the end, which fires a 
`SparkListenerBlockManagerRemoved` during removal. Note that, there won't be a 
`SparkListenerExecutorRemoved` fired since scheduler backend 
(`executorDataMap`) already doesn't contain the executor.
   
   @Ngone51 @sumeetgajjar 
   
   `HeartbeatReceiver.expireDeadHosts` will not clean those `BlockManager` if 
the executor is killed with reason Executor heartbeat timed out.  Executors 
could heartbeat timed out because of network issue, or some other reason like 
SPARK-20977 
   
   Am I right ?
   
   Driver Logs
   ```
   21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36] 
spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent heartbeats: 
350149 ms exceeds timeout 300000 ms
   21/11/13 05:06:20,999 INFO [kill-executor-thread] 
cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
   21/11/13 05:06:21,000 INFO [kill-executor-thread] 
cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be killed 
is 3056
   21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8] 
yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s) 
3056.
   21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler] 
cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor 
3056 with reason Executor heartbeat timed out after 350149 ms
   21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler] 
cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor 
heartbeat timed out after 350149 ms
   21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler] 
scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID 
245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056 
exited caused by one of the running tasks) Reason: Executor heartbeat timed out 
after 350149 ms
   21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler] 
scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID 
245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056 
exited caused by one of the running tasks) Reason: Executor heartbeat timed out 
after 350149 ms
   21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster] 
storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from 
BlockManagerMaster.
   21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster] 
storage.BlockManagerMasterEndpoint:57 : Removing block manager 
BlockManagerId(3056, executor_host, 30504, None)
   21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop] 
storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
   21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster] 
storage.BlockManagerMasterEndpoint:57 : Registering block manager 
executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host, 
30504, None)
   ```
   
   Executor Logs
   ```
   21/11/13 05:06:21,004 INFO [dispatcher-Executor] 
executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
   21/11/13 05:06:22,215 INFO [block-manager-future-0] 
storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056, 
executor_host, 30504, None)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to