GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2910
[backport] [FLINK-5193] [jm] Harden job recovery in case of recovery
failures
This is a backport of #2909 to the release 1.1 branch.
When recovering multiple jobs a single recovery failure caused all jobs to
be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so
that a single
failure won't stall the complete recovery. Furthermore, this PR improves
the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.
cc @uce
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink
backportFixJobRecoveryFailure
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2910.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2910
----
commit 01620e88ca5a963941ced979c143ab95777249d8
Author: Till Rohrmann <[email protected]>
Date: 2016-11-29T16:31:08Z
[FLINK-5193] [jm] Harden job recovery in case of recovery failures
When recovering multiple jobs a single recovery failure caused all jobs to
be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so
that a single
failure won't stall the complete recovery. Furthermore, this PR improves
the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.
Add test case
Fix failing JobManagerHACheckpointRecoveryITCase
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---