GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/3548
[SPARK-4498] [WIP] Add driver -> master heartbeat to detect exited
applications and fix executor failure detection logic
This is a WIP fix for SPARK-4498; this isn't the final fix that I want to
merge in, but I'm submitting this now to get early feedback from Jenkins and
reviewers. The main idea here is to add a periodic driver -> master heartbeat
that both signals driver liveness and carries information on whether it the
driver has received executors, which allows us to implement proper "don't kill
an application due to failed executors as long as it has some running
executors" logic in the master.
See discussion at https://issues.apache.org/jira/browse/SPARK-4498 for
context.
Before merging, this needs more comments and tests. Specifically, I need
tests to check that the heartbeat's information actually corresponds to the
right notion of application progress / liveness. There's also open questions
about heartbeat interval configuration and failure thresholds. I'll edit this
description to accurately reflect the PR before I remove the `[WIP]` tag.
/cc @markhamstra @aarondav @andrewor14 @pwendell @airhorns
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
standalone-failure-detector-interface
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3548.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3548
----
commit 87d7960d660b218a9a965fd7d344e2aae0250128
Author: Josh Rosen <[email protected]>
Date: 2014-12-02T01:08:17Z
Factor application failure detector logic into own class; add tests.
commit 08746eb02ed6e3d114c56ed77a225a1841e3d7ea
Author: Josh Rosen <[email protected]>
Date: 2014-12-02T04:52:31Z
[SPARK-4498] [WIP] Add driver -> master heartbeat
commit 418af7ea5e78e2d24104f3cf024f412e1c23bdb6
Author: Josh Rosen <[email protected]>
Date: 2014-12-02T04:55:36Z
Revert debugging comment
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]