GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/309
[SPARK-1397] [WIP] Notify SparkListeners when stages fail or are cancelled.
[I wanted to post this for folks to comment but it depends on (and thus
includes the changes in) a currently outstanding PR, #305. You can look at
just the second commit:
https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac
to see just the changes relevant to this PR]
Previously, when stages fail or get cancelled, the SparkListener is only
notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a
single
stage that failed. This worked before job cancellation, because jobs would
only fail
due to a single stage failure. However, with job cancellation, multiple
running stages
can fail when a job gets cancelled. Right now, this is not handled
correctly, which
results in stages that get stuck in the âRunning Stagesâ window in the
UI even
though theyâre dead.
This PR changes the SparkListenerStageCompleted event to a
SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in
addition to when
they complete successfully. This change is NOT publicly backward
compatible for two
reasons. First, it changes the SparkListener interface. We could
alternately add a new event,
SparkListenerStageFailed, and keep the existing
SparkListenerStageCompleted. However,
this is less consistent with the listener events for tasks / jobs ending,
and will result in some
code duplication for listeners (because failed and completed stages are
handled in similar
ways). Note that I havenât finished updating the JSON code to correctly
handle the new event
because Iâm waiting for feedback on whether this is a good or bad idea
(hence the âWIPâ).
It is also not backwards compatible because it changes the publicly visible
JobWaiter.jobFailed()
method to no longer include a stage that caused the failure. I think this
change should definitely
stay, because with cancellation (as described above), a failure isnât
necessarily caused by a
single stage.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 stage_cancellation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/309.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #309
----
commit 6aae3b119dd259330f9530a27af84c9575967132
Author: Kay Ousterhout <[email protected]>
Date: 2014-04-02T18:14:53Z
Properly cleanup DAGScheduler on job cancellation.
Previously, when jobs were cancelled, not all of the state in the
DAGScheduler was cleaned up, leading to a slow memory leak in the
DAGScheduler. As we expose easier ways ot cancel jobs, it's more
important to fix these issues.
This commit adds 3 tests. ârun shuffle with map stage failureâ is
a new test to more thoroughly test this functionality, and passes on
both the old and new versions of the code. âtrivial job
cancellationâ fails on the old code because all state wasnât cleaned
up correctly when jobs were cancelled (we didnât remove the job from
resultStageToJob). âfailure of stage used by two jobsâ fails on the
old code because taskScheduler.cancelTasks wasnât called for one of
the stages (see test comments).
commit 93f08baf731b9eaf5c9792a5373560526e2bccac
Author: Kay Ousterhout <[email protected]>
Date: 2014-04-02T22:54:29Z
Notify SparkListeners when stages fail or are cancelled.
Previously, when stages fail or get cancelled, the SparkListener is only
notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a
single
stage that failed. This worked before job cancellation, because jobs would
only fail
due to a single stage failure. However, with job cancellation, multiple
running stages
can fail when a job gets cancelled. Right now, this is not handled
correctly, which
results in stages that get stuck in the âRunning Stagesâ window in the
UI even
though theyâre dead.
This PR changes the SparkListenerStageCompleted event to a
SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in
addition to when
they complete successfully. This change is NOT publicly backward
compatible for two
reasons. First, it changes the SparkListener interface. We could
alternately add a new event,
SparkListenerStageFailed, and keep the existing
SparkListenerStageCompleted. However,
this is less consistent with the listener events for tasks / jobs ending,
and will result in some
code duplication for listeners (because failed and completed stages are
handled in similar
ways). Note that I havenât finished updating the JSON code to correctly
handle the new event
because Iâm waiting for feedback on whether this is a good or bad idea
(hence the âWIPâ).
It is also not backwards compatible because it changes the publicly visible
JobWaiter.jobFailed()
method to no longer include a stage that caused the failure. I think this
change should definitely
stay, because with cancellation (as described above), a failure isnât
necessarily caused by a
single stage.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---