Merge pull request #190 from markhamstra/Stages4Jobs

stageId <--> jobId mapping in DAGScheduler

Okay, I think this one is ready to go -- or at least it's ready for review and 
discussion.  It's a carry-over of https://github.com/mesos/spark/pull/842 with 
updates for the newer job cancellation functionality.  The prior discussion 
still applies.  I've actually changed the job cancellation flow a bit: Instead 
of ``cancelTasks`` going to the TaskScheduler and then ``taskSetFailed`` coming 
back to the DAGScheduler (resulting in ``abortStage`` there), the DAGScheduler 
now takes care of figuring out which stages should be cancelled, tells the 
TaskScheduler to cancel tasks for those stages, then does the cleanup within 
the DAGScheduler directly without the need for any further prompting by the 
TaskScheduler.

I know of three outstanding issues, each of which can and should, I believe, be 
handled in follow-up pull requests:

1) https://spark-project.atlassian.net/browse/SPARK-960
2) JobLogger should be re-factored to eliminate duplication
3) Related to 2), the WebUI should also become a consumer of the DAGScheduler's 
new understanding of the relationship between jobs and stages so that it can 
display progress indication and the like grouped by job.  Right now, some of 
this information is just being sent out as part of ``SparkListenerJobStart`` 
messages, but more or different job <--> stage information may need to be 
exported from the DAGScheduler to meet listeners needs.

Except for the eventQueue -> Actor commit, the rest can be cherry-picked almost 
cleanly into branch-0.8.  A little merging is needed in MapOutputTracker and 
the DAGScheduler.  Merged versions of those files are in 
https://github.com/markhamstra/incubator-spark/tree/aba2b40ce04ee9b7b9ea260abb6f09e050142d43

Note that between the recent Actor change in the DAGScheduler and the cleaning 
up of DAGScheduler data structures on job completion in this PR, some races 
have been introduced into the DAGSchedulerSuite.  Those tests usually pass, and 
I don't think that better-behaved code that doesn't directly inspect 
DAGScheduler data structures should be seeing any problems, but I'll work on 
fixing DAGSchedulerSuite as either an addition to this PR or as a separate 
request.

UPDATE: Fixed the race that I introduced.  Created a JIRA issue (SPARK-965) for 
the one that was introduced with the switch to eventProcessorActor in the 
DAGScheduler.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/e0392343
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/e0392343
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/e0392343

Branch: refs/heads/scala-2.10
Commit: e0392343a026d632bac0df0ad4f399fce742c151
Parents: bfa6860 403234d
Author: Matei Zaharia <[email protected]>
Authored: Fri Dec 6 11:49:59 2013 -0800
Committer: Matei Zaharia <[email protected]>
Committed: Fri Dec 6 11:49:59 2013 -0800

----------------------------------------------------------------------
 .../org/apache/spark/MapOutputTracker.scala     |   8 +-
 .../apache/spark/scheduler/DAGScheduler.scala   | 278 +++++++++++++++----
 .../spark/scheduler/DAGSchedulerEvent.scala     |   3 +-
 .../apache/spark/scheduler/SparkListener.scala  |   2 +-
 .../scheduler/cluster/ClusterScheduler.scala    |   4 +-
 .../cluster/ClusterTaskSetManager.scala         |   2 +-
 .../spark/scheduler/local/LocalScheduler.scala  |  27 +-
 .../org/apache/spark/JobCancellationSuite.scala |   4 +-
 .../spark/scheduler/DAGSchedulerSuite.scala     |  43 ++-
 9 files changed, 280 insertions(+), 91 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/e0392343/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterScheduler.scala
----------------------------------------------------------------------

Reply via email to