[GitHub] spark pull request #15986: [SPARK-18553][CORE][branch-2.0] Fix leak of TaskS...

JoshRosen Tue, 22 Nov 2016 19:29:51 -0800

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15986


    [SPARK-18553][CORE][branch-2.0] Fix leak of TaskSetManager following 
executor loss

    ## What changes were proposed in this pull request?
    
    This patch fixes a critical resource leak in the TaskScheduler which could 
cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor 
with running tasks is permanently lost and the associated stage fails.
    
    This problem was originally identified by analyzing the heap dump of a 
driver belonging to a cluster that had run out of shuffle space. This dump 
contained several `ShuffleDependency` instances that were retained by 
`TaskSetManager`s inside the scheduler but were not otherwise referenced. Each 
of these `TaskSetManager`s was considered a "zombie" but had no running tasks 
and therefore should have been cleaned up. However, these zombie task sets were 
still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map.
    
    Entries are added to the `taskIdToTaskSetManager` map when tasks are 
launched and are removed inside of `TaskScheduler.statusUpdate()`, which is 
invoked by the scheduler backend while processing `StatusUpdate` messages from 
executors. The problem with this design is that a completely dead executor will 
never send a `StatusUpdate`. There is [some 
code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338)
 in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` 
state (which is supposed to correspond to a task failure triggered by total 
executor loss), but this state only seems to be used in Mesos fine-grained 
mode. There doesn't seem to be any code which performs per-task state cleanup 
for tasks that were running on an executor that completely disappears without 
sending any sort of final death message. The `executorLost` and 
[`removeExecutor`](https://github.com/ap
 
ache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527)
 methods don't appear to perform any cleanup of the `taskId -> *` mappings, 
causing the leaks observed here.
    
    This patch's fix is to maintain a `executorId -> running task id` mapping 
so that these `taskId -> *` maps can be properly cleaned up following an 
executor loss. I also fixed a minor [thread-safety 
bug](https://github.com/apache/spark/pull/11888#issuecomment-262423508) 
introduced in #11888.
    
    There are some potential corner-case interactions that I'm concerned about 
here, especially some details in [the 
comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523)
 in `removeExecutor`, so I'd appreciate a very careful review of these changes.
    
    This PR is opened against branch-2.0, where I first observed this problem, 
but will also need to be fixed in master, branch-2.1, and branch-1.6 (which 
I'll do in followup PRs after this fix is reviewed and merged).
    
    ## How was this patch tested?
    
    I added a new unit test to `TaskSchedulerImplSuite`. You can check out this 
PR as of 25e455e711b978cd331ee0f484f70fde31307634 to see the failing test.
    
    cc @kayousterhout, @markhamstra, @rxin, and @sitalkedia for review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
fix-leak-following-total-executor-loss

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15986
    
----
commit 25e455e711b978cd331ee0f484f70fde31307634
Author: Josh Rosen <[email protected]>
Date:   2016-11-23T02:08:07Z

    Add failing regression test.

commit 69feae3591adc9fe88aff8c190d0d95f14cb0ced
Author: Josh Rosen <[email protected]>
Date:   2016-11-23T03:06:02Z

    Initial fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15986: [SPARK-18553][CORE][branch-2.0] Fix leak of TaskS...

Reply via email to