GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2909
[FLINK-5193] [jm] Harden job recovery in case of recovery failures
When recovering multiple jobs a single recovery failure caused all jobs to
be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so
that a single
failure won't make the complete recovery fail. Furthermore, this PR
improves the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.
Add test case
Fix failing JobManagerHACheckpointRecoveryITCase
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink fixJobRecoveryFailure
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2909.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2909
----
commit d61636d0465e0e0f274871a883d8d376c223a1f3
Author: Till Rohrmann <[email protected]>
Date: 2016-11-29T16:31:08Z
[FLINK-5193] [jm] Harden job recovery in case of recovery failures
When recovering multiple jobs a single recovery failure caused all jobs to
be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so
that a single
failure won't stall the complete recovery. Furthermore, this PR improves
the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.
Add test case
Fix failing JobManagerHACheckpointRecoveryITCase
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---