[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-6160:
----------------------------
    Description: 
In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
the remote component. Furthermore, it assumes that the component has actually 
failed and, thus, it will only start trying to connect to the component if it 
is notified about a new leader address and leader session id. This is brittle, 
because the heartbeat could also time out without the component having crashed. 
Thus, we should add an automatic retry to the latest known leader address 
information in case of a timeout.

*Acceptance criteria:*
  - The registration should be retried until a time limit expires after which 
the {{TaskExecutor}} terminates.
  - If the registration is declined ({{RegistrationResponse.Decline}}), the 
{{TaskExecutor}} should terminate.

  was:In case of a heartbeat timeout, the {{TaskExecutor}} closes the 
connection to the remote component. Furthermore, it assumes that the component 
has actually failed and, thus, it will only start trying to connect to the 
component if it is notified about a new leader address and leader session id. 
This is brittle, because the heartbeat could also time out without the 
component having crashed. Thus, we should add an automatic retry to the latest 
known leader address information in case of a timeout.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---------------------------------------------------------------
>
>                 Key: FLINK-6160
>                 URL: https://issues.apache.org/jira/browse/FLINK-6160
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to