[ 
https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646645#comment-16646645
 ] 

Ufuk Celebi commented on FLINK-6625:
------------------------------------

{quote}Can't a failed checkpoint (unable to commit something, write somewhere, 
etc.) fail the job? Such an incomplete checkpoint would make HA-recovery 
impossible.{quote}

I think not. The scenario you describe would result in a {{FAILED}} checkpoint 
and the previous checkpoint would be the latest one in the HA store. 

> Flink removes HA job data when reaching JobStatus.FAILED
> --------------------------------------------------------
>
>                 Key: FLINK-6625
>                 URL: https://issues.apache.org/jira/browse/FLINK-6625
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.4.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> Currently, Flink removes all job related data (submitted {{JobGraph}} as well 
> as checkpoints) when it reaches a globally terminal state (including 
> {{JobStatus.FAILED}}). In high availability mode, this entails that all data 
> is removed from ZooKeeper and there is no way to recover the job by 
> restarting the cluster with the same cluster id.
> I think this is problematic, since an application might just have failed 
> because it has depleted its numbers of restart attempts. Also the last 
> checkpoint information could be helpful when trying to find out why the job 
> has actually failed. I propose that we only remove job data when reaching the 
> state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to