GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6035
[FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor ## What is the purpose of the change If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect to the last known RM address. Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection and ultimately frees the occupied resources. The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min". cc @GJL. ## Brief change log - Retry connection to RM in case of heartbeat timeout on `JobMaster` and `TaskExecutor` - Fail `TaskExecutor` if we could not connect to `RM` within `TaskManagerOptions#REGISTRATION_TIMEOUT` ## Verifying this change - Adapted `JobMasterTest#testHeartbeatTimeoutWithResourceManager` - Adapted `TaskExecutorTest#testHeartbeatTimeoutWithResourceManager` - Added `TaskExecutorTest#testMaximumRegistrationDuration` and `TaskExecutorTest#testMaximumRegistrationDurationAfterConnectionLoss` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixReconnection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6035.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6035 ---- commit 6b45c84cf06688099e71c9e1809917653af43d31 Author: Till Rohrmann <trohrmann@...> Date: 2018-05-17T12:44:14Z [FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect to the last known RM address. Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection and ultimately frees the occupied resources. The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min". ---- ---