Ufuk Celebi created FLINK-10751:
-----------------------------------

             Summary: Checkpoints should be retained when job reaches suspended 
state
                 Key: FLINK-10751
                 URL: https://issues.apache.org/jira/browse/FLINK-10751
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.6.2
            Reporter: Ufuk Celebi
            Assignee: Ufuk Celebi


{{CheckpointProperties}} define in which terminal job status a checkpoint 
should be disposed.

I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, 
{{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) 
terminal job status {{SUSPENDED}}.

Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses 
leadership, this would result in the checkpoint to be cleaned up and not being 
available for recovery by the new leader. Therefore, we should rather retain 
checkpoints when reachingĀ job status {{SUSPENDED}}.

*BUT:* Because we special case this terminal state in the only highly available 
{{CompletedCheckpointStore}} implementation (seeĀ 
[ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315])
 and don't use regular checkpoint disposal, this issue has not surfaced yet.

I think we should proactively fix the properties to indicate to retain 
checkpoints in {{SUSPENDED}} state. We might actually completely remove this 
case since with this change, all properties will indicate to retain on 
suspension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to