[
https://issues.apache.org/jira/browse/FLINK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-10751:
----------------------------------
Affects Version/s: 1.7.0
> Checkpoints should be retained when job reaches suspended state
> ---------------------------------------------------------------
>
> Key: FLINK-10751
> URL: https://issues.apache.org/jira/browse/FLINK-10751
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.6.2, 1.7.0
> Reporter: Ufuk Celebi
> Assignee: Ufuk Celebi
> Priority: Minor
>
> {{CheckpointProperties}} define in which terminal job status a checkpoint
> should be disposed.
> I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}},
> {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally)
> terminal job status {{SUSPENDED}}.
> Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses
> leadership, this would result in the checkpoint to be cleaned up and not
> being available for recovery by the new leader. Therefore, we should rather
> retain checkpoints when reachingĀ job status {{SUSPENDED}}.
> *BUT:* Because we special case this terminal state in the only highly
> available {{CompletedCheckpointStore}} implementation (seeĀ
> [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315])
> and don't use regular checkpoint disposal, this issue has not surfaced yet.
> I think we should proactively fix the properties to indicate to retain
> checkpoints in {{SUSPENDED}} state. We might actually completely remove this
> case since with this change, all properties will indicate to retain on
> suspension.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)