GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/5746
[FLINK-8943] [ha] Fail Dispatcher if jobs cannot be recovered from HA store ## What is the purpose of the change In HA mode, the Dispatcher should fail if it cannot recover the persisted jobs. The idea is that another Dispatcher will be brought up and tries it again. This is better than simply dropping the not recovered jobs. cc @GJL ## Brief change log - Fail the `Dispatcher`/`JobManager` in case that we cannot recover a persisted job ## Verifying this change - Added `DispatcherTest#testFatalErrorAfterJobIdRecoveryFailure` and `DispatcherTest#testFatalErrorAfterJobRecoveryFailure` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink failIfJobNotRecoverable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5746.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5746 ---- commit d15e5a2897e5b17ee256cac1374bbcee24104fe2 Author: Till Rohrmann <trohrmann@...> Date: 2018-03-22T09:46:04Z [hotfix] Extend TestingFatalErrorHandler to return an error future commit 50004f3cfcba112d0e7f05b9875931d25d102110 Author: Till Rohrmann <trohrmann@...> Date: 2018-03-22T09:46:28Z [hotfix] Add BiFunctionWithException commit f6a6d2da064ff80125600a3a90e773684dc24715 Author: Till Rohrmann <trohrmann@...> Date: 2018-03-21T21:36:33Z [FLINK-8943] [ha] Fail Dispatcher if jobs cannot be recovered from HA store In HA mode, the Dispatcher should fail if it cannot recover the persisted jobs. The idea is that another Dispatcher will be brought up and tries it again. This is better than simply dropping the not recovered jobs. ---- ---