GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2910

    [backport] [FLINK-5193] [jm] Harden job recovery in case of recovery 
failures

    This is a backport of #2909 to the release 1.1 branch.
    
    When recovering multiple jobs a single recovery failure caused all jobs to 
be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so 
that a single
    failure won't stall the complete recovery. Furthermore, this PR improves 
the error reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    cc @uce 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink 
backportFixJobRecoveryFailure

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2910.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2910
    
----
commit 01620e88ca5a963941ced979c143ab95777249d8
Author: Till Rohrmann <[email protected]>
Date:   2016-11-29T16:31:08Z

    [FLINK-5193] [jm] Harden job recovery in case of recovery failures
    
    When recovering multiple jobs a single recovery failure caused all jobs to 
be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so 
that a single
    failure won't stall the complete recovery. Furthermore, this PR improves 
the error reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    Add test case
    
    Fix failing JobManagerHACheckpointRecoveryITCase

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to