GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/305

    Properly cleanup DAGScheduler on job cancellation.

    Previously, when jobs were cancelled, not all of the state in the
    DAGScheduler was cleaned up, leading to a slow memory leak in the
    DAGScheduler.  As we expose easier ways to cancel jobs, it's more
    important to fix these issues.
    
    This commit also fixes a second and less serious problem, which is that
    previously, when a stage failed, not all of the appropriate stages
    were cancelled.  See the "failure of stage used by two jobs" test
    for an example of this.  This just meant that extra work was done, and is
    not a correctness problem.
    
    This commit adds 3 tests.  “run shuffle with map stage failure” is
    a new test to more thoroughly test this functionality, and passes on
    both the old and new versions of the code.  “trivial job
    cancellation” fails on the old code because all state wasn’t cleaned
    up correctly when jobs were cancelled (we didn’t remove the job from
    resultStageToJob).  “failure of stage used by two jobs” fails on the
    old code because taskScheduler.cancelTasks wasn’t called for one of
    the stages (see test comments).
    
    This should be checked in before #246, which makes it easier to
    cancel stages / jobs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 incremental_abort_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #305
    
----
commit 33f472d983ebfd8c0b7a99adb1d62ed2df4275bb
Author: Kay Ousterhout <[email protected]>
Date:   2014-04-02T18:14:53Z

    Properly cleanup DAGScheduler on job cancellation.
    
    Previously, when jobs were cancelled, not all of the state in the
    DAGScheduler was cleaned up, leading to a slow memory leak in the
    DAGScheduler.  As we expose easier ways ot cancel jobs, it's more
    important to fix these issues.
    
    This commit adds 3 tests.  “run shuffle with map stage failure” is
    a new test to more thoroughly test this functionality, and passes on
    both the old and new versions of the code.  “trivial job
    cancellation” fails on the old code because all state wasn’t cleaned
    up correctly when jobs were cancelled (we didn’t remove the job from
    resultStageToJob).  “failure of stage used by two jobs” fails on the
    old code because taskScheduler.cancelTasks wasn’t called for one of
    the stages (see test comments).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to