[ 
https://issues.apache.org/jira/browse/SPARK-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269517#comment-16269517
 ] 

Thomas Graves commented on SPARK-22620:
---------------------------------------

Looking at this some more it seems that normally when the executor heartbeat 
times out you would also get the onDisconnect message  (either from yarn 
telling it to kill the container or from process eventually dying) which would 
call CoarseGrainedSchedulerBackend.removeExecutor which would clear up this 
issue.  It seems in my case that didn't happen. In my case that host was in a 
bad state and yarn couldn't tell the nodemanager to kill the container. It may 
have been in a weird state such that the connection was still up but 
nodemanager was down. 
It seems like it would be a good idea for us to call the removeExecutor from 
the heartbeat receiver so we aren't relying on that onDisconnect message coming 
and 
  things being out of sync.  It won't hurt anything is the removeExecutor is 
called twice since it first checks to see it its in executorDataMap

> Deadlock in blacklisting when executor is removed due to a heartbeat timeout
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-22620
>                 URL: https://issues.apache.org/jira/browse/SPARK-22620
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.2.0
>            Reporter: Thomas Graves
>            Priority: Critical
>
> I ran into a deadlock dealing with blacklisting where the application was 
> stuck in a state that it thought it had executors to schedule on but it 
> really didn't because those executors were dead.  So all executors were 
> blacklisted and it couldn't run any tasks.
> Note for this to happen the application has to be in a state where many tasks 
> are going to fail (at least enough tasks fail to cause all executors to be 
> blacklisted) and an executor had to get a heartbeat timeout.
> Small disclaimer in that I ran into this on spark 2.1 with the blacklisting 
> feature backported.  Looking at the code for master branch I don't see this 
> fixed though.
> Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that 
> prevents the app from becoming deadlocked when no executors are available.  
> Unfortunately this did not take affect because in this case a dead executor 
> was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host 
> because it had a heartbeat timeout so it thought it still had executors to 
> run on.
> Now how it got into this situation:
> The HeartBeatReceiver was called because it timed out a based on no 
> heartbeat. This called TaskSchedulerImpl.executorLost, this called 
> TaskSchedulerImpl.removeExecutor which removed it from the 
> executorIdToRunningTaskIds and hostToExecutors.   It does other handling here 
> and sends a CompletionEvent in DAG.taskEnded.  
> TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost 
> calls into  DAG.executorLost and then 
> CoarseGrainedSchedulerBackend.reviveOffers which both send messages async.   
> This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor 
> asynchronously.
> Now there are a bunch of messages sent which are being processed by other 
> threads.  You can see the order in which things ran from the logs below.  In 
> this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before 
> sc.killAndReplaceExecutor can add the host to executorsPendingToRemove.  
> CoarseGrainedSchedulerBackend.makeOffers uses 
> CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to 
> check to see if the executor exists and it is alive. At this point its in the 
> executorDataMap and its not Pending to remove.   Since its still there it 
> calls into TaskSchedulerImpl.resourceOffers which adds the executor back into 
> executorIdToRunningTaskIds and hostToExecutors.  There is nothing else after 
> this to remove it from hostToExecutors so the TaskSetManager. 
> abortIfCompletelyBlacklisted routine always sees an executor that is actually 
> dead.
> While looking at this I also found a bug where it doesn't remove the executor 
> from the CoarseGrainedSchedulerBackend.executorDataMap as well.  Normally the 
> call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but 
> since it was a heartbeat timeout this is never called.  The only way that 
> would have helped in this situation is if it was removed from executorDataMap 
> before the called to TaskSchedulerImpl.resourceOffers -> 
> CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never 
> been added back to executorIdToRunningTaskIds and hostToExecutors. Note other 
> stats like totalCoreCount and totalRegisteredExecutors will also be off since 
> removeExecutor is not called.
> Below are the logs that show the race.  Notice how the "Requesting to kill 
> executor(s) 137" came in the middle of the removal which is in 
> CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run 
> through to add it to executorsPendingToRemove "Actual list of executor(s) to 
> be killed is 137" until after it was added back to the hostToExecutors list 
> ("Host added was in lost list earlier: foo1.com")
> 2017-11-14 15:56:44,376 [task-result-getter-0] INFO  
> org.apache.spark.scheduler.TaskSetManager  - Finished task 3203.0 in stage 
> 7.0 (TID 11011) in 1396594 ms on foo3.com (executor 8) (2971/4004)
> 2017-11-14 15:56:50,729 [dispatcher-event-loop-23] WARN  
> org.apache.spark.HeartbeatReceiver  - Removing executor 137 with no recent 
> heartbeats: 639438 ms exceeds timeout 600000 ms
> 2017-11-14 15:56:50,731 [dispatcher-event-loop-23] ERROR 
> org.apache.spark.scheduler.cluster.YarnClusterScheduler  - Lost executor 137 
> on foo1.com: Executor heartbeat timed out after 639438 ms
> 2017-11-14 15:56:50,773 [dispatcher-event-loop-23] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Lost task 2696.0 in stage 7.0 
> (TID 11061, foo1.com, executor 137): ExecutorLostFailure (executor 137 exited 
> caused by one of the running tasks) Reason: Executor heartbeat timed out 
> after 639438 ms
> 2017-11-14 15:56:50,778 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - Executor lost: 137 (epoch 0)
> 2017-11-14 15:56:50,779 [dispatcher-event-loop-20] INFO  
> org.apache.spark.storage.BlockManagerMasterEndpoint  - Trying to remove 
> executor 137 from BlockManagerMaster.
> 2017-11-14 15:56:50,779 [kill-executor-thread] INFO  
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend  - Requesting 
> to kill executor(s) 137
> 2017-11-14 15:56:50,780 [dispatcher-event-loop-20] INFO  
> org.apache.spark.storage.BlockManagerMasterEndpoint  - Removing block manager 
> BlockManagerId(137, foo1.com, 42569, None)
> 2017-11-14 15:56:50,780 [dag-scheduler-event-loop] INFO  
> org.apache.spark.storage.BlockManagerMaster  - Removed 137 successfully in 
> removeExecutor
> 2017-11-14 15:56:50,781 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - Host added was in lost list 
> earlier: foo1.com
> 2017-11-14 15:56:50,782 [kill-executor-thread] INFO  
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend  - Actual list 
> of executor(s) to be killed is 137
> 2017-11-14 15:56:50,827 [dispatcher-event-loop-6] INFO  
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint  - Driver requested 
> to kill executor(s) 137.
> 2017-11-14 15:56:51,616 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator  - Will request 1 executor 
> container(s), each with 10 core(s) and 15360 MB memory (including 8192 MB of 
> overhead)
> Note there may be another bug here with speculation as well because some time 
> later I see this same host in a log message that makes it sound like it 
> thinks the task is still running on that dead host:
> 2017-11-14 16:46:40,970 [task-scheduler-speculation] INFO  
> org.apache.spark.scheduler.TaskSetManager  - Marking task 2696 in stage 7.0 
> (on gsbl560n05.blue.ygrid.yahoo.com) as speculatable because it ran more than 
> 2254168 ms
> We can split this off into another jira if we find its actually a bug.  It 
> looks like the issue there is the TaskSetManager.copiesRunning is not 
> decremented because we are using external shuffle service but needs further 
> investigation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to