[
https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646333#comment-16646333
]
Robert Metzger commented on FLINK-6625:
---------------------------------------
Can we guarantee that checkpoints are consistent when a job has finished with
FAILED?
Can't a failed checkpoint (unable to commit something, write somewhere, etc.)
fail the job? Such an incomplete checkpoint would make HA-recovery impossible.
> Flink removes HA job data when reaching JobStatus.FAILED
> --------------------------------------------------------
>
> Key: FLINK-6625
> URL: https://issues.apache.org/jira/browse/FLINK-6625
> Project: Flink
> Issue Type: Improvement
> Components: Distributed Coordination
> Affects Versions: 1.3.0, 1.4.0
> Reporter: Till Rohrmann
> Priority: Major
>
> Currently, Flink removes all job related data (submitted {{JobGraph}} as well
> as checkpoints) when it reaches a globally terminal state (including
> {{JobStatus.FAILED}}). In high availability mode, this entails that all data
> is removed from ZooKeeper and there is no way to recover the job by
> restarting the cluster with the same cluster id.
> I think this is problematic, since an application might just have failed
> because it has depleted its numbers of restart attempts. Also the last
> checkpoint information could be helpful when trying to find out why the job
> has actually failed. I propose that we only remove job data when reaching the
> state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)