[ 
https://issues.apache.org/jira/browse/FLINK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-3411:
-------------------------------
    Description: 
When a job is recovered by a standby job manager and the recovery of the 
checkpoint state or job fails, the job might be eventually removed by the job 
manager after all retries are exhausted. This leads to the removal of the 
job/checkpoint state in ZooKeeper and the state backend, making it impossible 
to ever recover the job again.

We should never exhaust job retries in the HA case.

  was:When a job is recovered by a standby job manager and the recovery of the 
checkpoint state or job fails, the job will be removed by the job manager. This 
leads to the removal of the job/checkpoint state in ZooKeeper and the state 
backend, making it impossible to ever recover the job again.


> Failed recovery can lead to removal of HA state
> -----------------------------------------------
>
>                 Key: FLINK-3411
>                 URL: https://issues.apache.org/jira/browse/FLINK-3411
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Priority: Critical
>
> When a job is recovered by a standby job manager and the recovery of the 
> checkpoint state or job fails, the job might be eventually removed by the job 
> manager after all retries are exhausted. This leads to the removal of the 
> job/checkpoint state in ZooKeeper and the state backend, making it impossible 
> to ever recover the job again.
> We should never exhaust job retries in the HA case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to