scheduler_hang

matei Thu, 14 Nov 2013 22:30:52 -0800

Merge pull request #173 from kayousterhout/scheduler_hang

Fix bug where scheduler could hang after task failure.


When a task fails, we need to call reviveOffers() so that the
task can be rescheduled on a different machine. In the current code,
the state in ClusterTaskSetManager indicating which tasks are
pending may be updated after revive offers is called (there's a
race condition here), so when revive offers is called, the task set
manager does not yet realize that there are failed tasks that need
to be relaunched.

This isn't currently unit tested but will be once my pull request for
merging the cluster and local schedulers goes in -- at which point
many more of the unit tests will exercise the code paths through
the cluster scheduler (currently the failure test suite uses the local
scheduler, which is why we didn't see this bug before).


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/96e0fb46
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/96e0fb46
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/96e0fb46

Branch: refs/heads/master
Commit: 96e0fb46309698b685c811a65bd8e1a691389994
Parents: dfd40e9 b4546ba
Author: Matei Zaharia <[email protected]>
Authored: Thu Nov 14 22:29:28 2013 -0800
Committer: Matei Zaharia <[email protected]>
Committed: Thu Nov 14 22:29:28 2013 -0800

----------------------------------------------------------------------
 .../spark/scheduler/cluster/ClusterScheduler.scala     | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)
----------------------------------------------------------------------

[2/2] git commit: Merge pull request #173 from kayousterhout/scheduler_hang

Reply via email to