[ 
https://issues.apache.org/jira/browse/SPARK-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268792#comment-16268792
 ] 

Thomas Graves commented on SPARK-22620:
---------------------------------------

Note this was running spark on yarn in cluster mode, this had dynamic 
allocation disabled, in this case all of the tasks started failing after a 
while because of an external service it was trying to connect to was down.

> Deadlock in blacklisting when executor is removed due to a heartbeat timeout
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-22620
>                 URL: https://issues.apache.org/jira/browse/SPARK-22620
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.2.0
>            Reporter: Thomas Graves
>            Priority: Critical
>
> I ran into a deadlock dealing with blacklisting where the application was 
> stuck in a state that it thought it had executors to schedule on but it 
> really didn't because those executors were dead.  So all executors were 
> blacklisted and it couldn't run any tasks.
> Note for this to happen the application has to be in a state where many tasks 
> are going to fail (at least enough tasks fail to cause all executors to be 
> blacklisted) and an executor had to get a heartbeat timeout.
> Small disclaimer in that I ran into this on spark 2.1 with the blacklisting 
> feature backported.  Looking at the code for master branch I don't see this 
> fixed though.
> Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that 
> prevents the app from becoming deadlocked when no executors are available.  
> Unfortunately this did not take affect because in this case a dead executor 
> was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host 
> because it had a heartbeat timeout so it thought it still had executors to 
> run on.
> Now how it got into this situation:
> The HeartBeatReceiver was called because it timed out a based on no 
> heartbeat. This called TaskSchedulerImpl.executorLost, this called 
> TaskSchedulerImpl.removeExecutor which removed it from the 
> executorIdToRunningTaskIds and hostToExecutors.   It does other handling here 
> and sends a CompletionEvent in DAG.taskEnded.  
> TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost 
> calls into  DAG.executorLost and then 
> CoarseGrainedSchedulerBackend.reviveOffers which both send messages async.   
> This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor 
> asynchronously.
> Now there are a bunch of messages sent which are being processed by other 
> threads.  You can see the order in which things ran from the logs below.  In 
> this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before 
> sc.killAndReplaceExecutor can add the host to executorsPendingToRemove.  
> CoarseGrainedSchedulerBackend.makeOffers uses 
> CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to 
> check to see if the executor exists and it is alive. At this point its in the 
> executorDataMap and its not Pending to remove.   Since its still there it 
> calls into TaskSchedulerImpl.resourceOffers which adds the executor back into 
> executorIdToRunningTaskIds and hostToExecutors.  There is nothing else after 
> this to remove it from hostToExecutors so the TaskSetManager. 
> abortIfCompletelyBlacklisted routine always sees an executor that is actually 
> dead.
> While looking at this I also found a bug where it doesn't remove the executor 
> from the CoarseGrainedSchedulerBackend.executorDataMap as well.  Normally the 
> call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but 
> since it was a heartbeat timeout this is never called.  The only way that 
> would have helped in this situation is if it was removed from executorDataMap 
> before the called to TaskSchedulerImpl.resourceOffers -> 
> CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never 
> been added back to executorIdToRunningTaskIds and hostToExecutors. Note other 
> stats like totalCoreCount and totalRegisteredExecutors will also be off since 
> removeExecutor is not called.
> Below are the logs that show the race.  Notice how the "Requesting to kill 
> executor(s) 137" came in the middle of the removal which is in 
> CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run 
> through to add it to executorsPendingToRemove "Actual list of executor(s) to 
> be killed is 137" until after it was added back to the hostToExecutors list 
> ("Host added was in lost list earlier: foo1.com")
> 2017-11-14 15:56:44,376 [task-result-getter-0] INFO  
> org.apache.spark.scheduler.TaskSetManager  - Finished task 3203.0 in stage 
> 7.0 (TID 11011) in 1396594 ms on foo3.com (executor 8) (2971/4004)
> 2017-11-14 15:56:50,729 [dispatcher-event-loop-23] WARN  
> org.apache.spark.HeartbeatReceiver  - Removing executor 137 with no recent 
> heartbeats: 639438 ms exceeds timeout 600000 ms
> 2017-11-14 15:56:50,731 [dispatcher-event-loop-23] ERROR 
> org.apache.spark.scheduler.cluster.YarnClusterScheduler  - Lost executor 137 
> on foo1.com: Executor heartbeat timed out after 639438 ms
> 2017-11-14 15:56:50,773 [dispatcher-event-loop-23] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Lost task 2696.0 in stage 7.0 
> (TID 11061, foo1.com, executor 137): ExecutorLostFailure (executor 137 exited 
> caused by one of the running tasks) Reason: Executor heartbeat timed out 
> after 639438 ms
> 2017-11-14 15:56:50,778 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - Executor lost: 137 (epoch 0)
> 2017-11-14 15:56:50,779 [dispatcher-event-loop-20] INFO  
> org.apache.spark.storage.BlockManagerMasterEndpoint  - Trying to remove 
> executor 137 from BlockManagerMaster.
> 2017-11-14 15:56:50,779 [kill-executor-thread] INFO  
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend  - Requesting 
> to kill executor(s) 137
> 2017-11-14 15:56:50,780 [dispatcher-event-loop-20] INFO  
> org.apache.spark.storage.BlockManagerMasterEndpoint  - Removing block manager 
> BlockManagerId(137, foo1.com, 42569, None)
> 2017-11-14 15:56:50,780 [dag-scheduler-event-loop] INFO  
> org.apache.spark.storage.BlockManagerMaster  - Removed 137 successfully in 
> removeExecutor
> 2017-11-14 15:56:50,781 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - Host added was in lost list 
> earlier: foo1.com
> 2017-11-14 15:56:50,782 [kill-executor-thread] INFO  
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend  - Actual list 
> of executor(s) to be killed is 137
> 2017-11-14 15:56:50,827 [dispatcher-event-loop-6] INFO  
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint  - Driver requested 
> to kill executor(s) 137.
> 2017-11-14 15:56:51,616 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator  - Will request 1 executor 
> container(s), each with 10 core(s) and 15360 MB memory (including 8192 MB of 
> overhead)
> Note there may be another bug here with speculation as well because some time 
> later I see this same host in a log message that makes it sound like it 
> thinks the task is still running on that dead host:
> 2017-11-14 16:46:40,970 [task-scheduler-speculation] INFO  
> org.apache.spark.scheduler.TaskSetManager  - Marking task 2696 in stage 7.0 
> (on gsbl560n05.blue.ygrid.yahoo.com) as speculatable because it ran more than 
> 2254168 ms
> We can split this off into another jira if we find its actually a bug.  It 
> looks like the issue there is the TaskSetManager.copiesRunning is not 
> decremented because we are using external shuffle service but needs further 
> investigation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to