GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2909

    [FLINK-5193] [jm] Harden job recovery in case of recovery failures

    When recovering multiple jobs a single recovery failure caused all jobs to 
be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so 
that a single
    failure won't make the complete recovery fail. Furthermore, this PR 
improves the error reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    Add test case
    
    Fix failing JobManagerHACheckpointRecoveryITCase

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixJobRecoveryFailure

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2909
    
----
commit d61636d0465e0e0f274871a883d8d376c223a1f3
Author: Till Rohrmann <[email protected]>
Date:   2016-11-29T16:31:08Z

    [FLINK-5193] [jm] Harden job recovery in case of recovery failures
    
    When recovering multiple jobs a single recovery failure caused all jobs to 
be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so 
that a single
    failure won't stall the complete recovery. Furthermore, this PR improves 
the error reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    Add test case
    
    Fix failing JobManagerHACheckpointRecoveryITCase

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to