GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/305
Properly cleanup DAGScheduler on job cancellation.
Previously, when jobs were cancelled, not all of the state in the
DAGScheduler was cleaned up, leading to a slow memory leak in the
DAGScheduler. As we expose easier ways to cancel jobs, it's more
important to fix these issues.
This commit also fixes a second and less serious problem, which is that
previously, when a stage failed, not all of the appropriate stages
were cancelled. See the "failure of stage used by two jobs" test
for an example of this. This just meant that extra work was done, and is
not a correctness problem.
This commit adds 3 tests. ârun shuffle with map stage failureâ is
a new test to more thoroughly test this functionality, and passes on
both the old and new versions of the code. âtrivial job
cancellationâ fails on the old code because all state wasnât cleaned
up correctly when jobs were cancelled (we didnât remove the job from
resultStageToJob). âfailure of stage used by two jobsâ fails on the
old code because taskScheduler.cancelTasks wasnât called for one of
the stages (see test comments).
This should be checked in before #246, which makes it easier to
cancel stages / jobs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 incremental_abort_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/305.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #305
----
commit 33f472d983ebfd8c0b7a99adb1d62ed2df4275bb
Author: Kay Ousterhout <[email protected]>
Date: 2014-04-02T18:14:53Z
Properly cleanup DAGScheduler on job cancellation.
Previously, when jobs were cancelled, not all of the state in the
DAGScheduler was cleaned up, leading to a slow memory leak in the
DAGScheduler. As we expose easier ways ot cancel jobs, it's more
important to fix these issues.
This commit adds 3 tests. ârun shuffle with map stage failureâ is
a new test to more thoroughly test this functionality, and passes on
both the old and new versions of the code. âtrivial job
cancellationâ fails on the old code because all state wasnât cleaned
up correctly when jobs were cancelled (we didnât remove the job from
resultStageToJob). âfailure of stage used by two jobsâ fails on the
old code because taskScheduler.cancelTasks wasnât called for one of
the stages (see test comments).
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---