[
https://issues.apache.org/jira/browse/FLINK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ufuk Celebi updated FLINK-3411:
-------------------------------
Description:
When a job is recovered by a standby job manager and the recovery of the
checkpoint state or job fails, the job might be eventually removed by the job
manager after all retries are exhausted. This leads to the removal of the
job/checkpoint state in ZooKeeper and the state backend, making it impossible
to ever recover the job again.
We should never exhaust job retries in the HA case.
was:When a job is recovered by a standby job manager and the recovery of the
checkpoint state or job fails, the job will be removed by the job manager. This
leads to the removal of the job/checkpoint state in ZooKeeper and the state
backend, making it impossible to ever recover the job again.
> Failed recovery can lead to removal of HA state
> -----------------------------------------------
>
> Key: FLINK-3411
> URL: https://issues.apache.org/jira/browse/FLINK-3411
> Project: Flink
> Issue Type: Bug
> Components: Distributed Runtime
> Reporter: Ufuk Celebi
> Priority: Critical
>
> When a job is recovered by a standby job manager and the recovery of the
> checkpoint state or job fails, the job might be eventually removed by the job
> manager after all retries are exhausted. This leads to the removal of the
> job/checkpoint state in ZooKeeper and the state backend, making it impossible
> to ever recover the job again.
> We should never exhaust job retries in the HA case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)