Till Rohrmann created FLINK-6160:
------------------------------------
Summary: Retry JobManager/ResourceManager connection in case of
timeout
Key: FLINK-6160
URL: https://issues.apache.org/jira/browse/FLINK-6160
Project: Flink
Issue Type: Sub-task
Components: Distributed Coordination
Affects Versions: 1.3.0
Reporter: Till Rohrmann
Fix For: 1.3.0
In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to
the remote component. Furthermore, it assumes that the component has actually
failed and, thus, it will only start trying to connect to the component if it
is notified about a new leader address and leader session id. This is brittle,
because the heartbeat could also time out without the component having crashed.
Thus, we should add an automatic retry to the latest known leader address
information in case of a timeout.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)