GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/3548

    [SPARK-4498] [WIP] Add driver -> master heartbeat to detect exited 
applications and fix executor failure detection logic

    This is a WIP fix for SPARK-4498; this isn't the final fix that I want to 
merge in, but I'm submitting this now to get early feedback from Jenkins and 
reviewers.  The main idea here is to add a periodic driver -> master heartbeat 
that both signals driver liveness and carries information on whether it the 
driver has received executors, which allows us to implement proper "don't kill 
an application due to failed executors as long as it has some running 
executors" logic in the master.
    
    See discussion at https://issues.apache.org/jira/browse/SPARK-4498 for 
context.
    
    Before merging, this needs more comments and tests.  Specifically, I need 
tests to check that the heartbeat's information actually corresponds to the 
right notion of application progress / liveness.  There's also open questions 
about heartbeat interval configuration and failure thresholds.  I'll edit this 
description to accurately reflect the PR before I remove the `[WIP]` tag.
    
    /cc @markhamstra @aarondav @andrewor14 @pwendell @airhorns

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
standalone-failure-detector-interface

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3548.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3548
    
----
commit 87d7960d660b218a9a965fd7d344e2aae0250128
Author: Josh Rosen <[email protected]>
Date:   2014-12-02T01:08:17Z

    Factor application failure detector logic into own class; add tests.

commit 08746eb02ed6e3d114c56ed77a225a1841e3d7ea
Author: Josh Rosen <[email protected]>
Date:   2014-12-02T04:52:31Z

    [SPARK-4498] [WIP] Add driver -> master heartbeat

commit 418af7ea5e78e2d24104f3cf024f412e1c23bdb6
Author: Josh Rosen <[email protected]>
Date:   2014-12-02T04:55:36Z

    Revert debugging comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to