[ https://issues.apache.org/jira/browse/SPARK-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268792#comment-16268792 ]
Thomas Graves commented on SPARK-22620: --------------------------------------- Note this was running spark on yarn in cluster mode, this had dynamic allocation disabled, in this case all of the tasks started failing after a while because of an external service it was trying to connect to was down. > Deadlock in blacklisting when executor is removed due to a heartbeat timeout > ---------------------------------------------------------------------------- > > Key: SPARK-22620 > URL: https://issues.apache.org/jira/browse/SPARK-22620 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.2.0 > Reporter: Thomas Graves > Priority: Critical > > I ran into a deadlock dealing with blacklisting where the application was > stuck in a state that it thought it had executors to schedule on but it > really didn't because those executors were dead. So all executors were > blacklisted and it couldn't run any tasks. > Note for this to happen the application has to be in a state where many tasks > are going to fail (at least enough tasks fail to cause all executors to be > blacklisted) and an executor had to get a heartbeat timeout. > Small disclaimer in that I ran into this on spark 2.1 with the blacklisting > feature backported. Looking at the code for master branch I don't see this > fixed though. > Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that > prevents the app from becoming deadlocked when no executors are available. > Unfortunately this did not take affect because in this case a dead executor > was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host > because it had a heartbeat timeout so it thought it still had executors to > run on. > Now how it got into this situation: > The HeartBeatReceiver was called because it timed out a based on no > heartbeat. This called TaskSchedulerImpl.executorLost, this called > TaskSchedulerImpl.removeExecutor which removed it from the > executorIdToRunningTaskIds and hostToExecutors. It does other handling here > and sends a CompletionEvent in DAG.taskEnded. > TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost > calls into DAG.executorLost and then > CoarseGrainedSchedulerBackend.reviveOffers which both send messages async. > This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor > asynchronously. > Now there are a bunch of messages sent which are being processed by other > threads. You can see the order in which things ran from the logs below. In > this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before > sc.killAndReplaceExecutor can add the host to executorsPendingToRemove. > CoarseGrainedSchedulerBackend.makeOffers uses > CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to > check to see if the executor exists and it is alive. At this point its in the > executorDataMap and its not Pending to remove. Since its still there it > calls into TaskSchedulerImpl.resourceOffers which adds the executor back into > executorIdToRunningTaskIds and hostToExecutors. There is nothing else after > this to remove it from hostToExecutors so the TaskSetManager. > abortIfCompletelyBlacklisted routine always sees an executor that is actually > dead. > While looking at this I also found a bug where it doesn't remove the executor > from the CoarseGrainedSchedulerBackend.executorDataMap as well. Normally the > call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but > since it was a heartbeat timeout this is never called. The only way that > would have helped in this situation is if it was removed from executorDataMap > before the called to TaskSchedulerImpl.resourceOffers -> > CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never > been added back to executorIdToRunningTaskIds and hostToExecutors. Note other > stats like totalCoreCount and totalRegisteredExecutors will also be off since > removeExecutor is not called. > Below are the logs that show the race. Notice how the "Requesting to kill > executor(s) 137" came in the middle of the removal which is in > CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run > through to add it to executorsPendingToRemove "Actual list of executor(s) to > be killed is 137" until after it was added back to the hostToExecutors list > ("Host added was in lost list earlier: foo1.com") > 2017-11-14 15:56:44,376 [task-result-getter-0] INFO > org.apache.spark.scheduler.TaskSetManager - Finished task 3203.0 in stage > 7.0 (TID 11011) in 1396594 ms on foo3.com (executor 8) (2971/4004) > 2017-11-14 15:56:50,729 [dispatcher-event-loop-23] WARN > org.apache.spark.HeartbeatReceiver - Removing executor 137 with no recent > heartbeats: 639438 ms exceeds timeout 600000 ms > 2017-11-14 15:56:50,731 [dispatcher-event-loop-23] ERROR > org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 137 > on foo1.com: Executor heartbeat timed out after 639438 ms > 2017-11-14 15:56:50,773 [dispatcher-event-loop-23] WARN > org.apache.spark.scheduler.TaskSetManager - Lost task 2696.0 in stage 7.0 > (TID 11061, foo1.com, executor 137): ExecutorLostFailure (executor 137 exited > caused by one of the running tasks) Reason: Executor heartbeat timed out > after 639438 ms > 2017-11-14 15:56:50,778 [dag-scheduler-event-loop] INFO > org.apache.spark.scheduler.DAGScheduler - Executor lost: 137 (epoch 0) > 2017-11-14 15:56:50,779 [dispatcher-event-loop-20] INFO > org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove > executor 137 from BlockManagerMaster. > 2017-11-14 15:56:50,779 [kill-executor-thread] INFO > org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting > to kill executor(s) 137 > 2017-11-14 15:56:50,780 [dispatcher-event-loop-20] INFO > org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager > BlockManagerId(137, foo1.com, 42569, None) > 2017-11-14 15:56:50,780 [dag-scheduler-event-loop] INFO > org.apache.spark.storage.BlockManagerMaster - Removed 137 successfully in > removeExecutor > 2017-11-14 15:56:50,781 [dag-scheduler-event-loop] INFO > org.apache.spark.scheduler.DAGScheduler - Host added was in lost list > earlier: foo1.com > 2017-11-14 15:56:50,782 [kill-executor-thread] INFO > org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Actual list > of executor(s) to be killed is 137 > 2017-11-14 15:56:50,827 [dispatcher-event-loop-6] INFO > org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested > to kill executor(s) 137. > 2017-11-14 15:56:51,616 [Reporter] INFO > org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor > container(s), each with 10 core(s) and 15360 MB memory (including 8192 MB of > overhead) > Note there may be another bug here with speculation as well because some time > later I see this same host in a log message that makes it sound like it > thinks the task is still running on that dead host: > 2017-11-14 16:46:40,970 [task-scheduler-speculation] INFO > org.apache.spark.scheduler.TaskSetManager - Marking task 2696 in stage 7.0 > (on gsbl560n05.blue.ygrid.yahoo.com) as speculatable because it ran more than > 2254168 ms > We can split this off into another jira if we find its actually a bug. It > looks like the issue there is the TaskSetManager.copiesRunning is not > decremented because we are using external shuffle service but needs further > investigation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org