[jira] [Updated] (FLINK-22494) Avoid discarding checkpoints in case of failure

Matthias (Jira) Tue, 27 Apr 2021 09:32:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias updated FLINK-22494:
-----------------------------
    Description: 
Both {{StateHandleStore}} implementations (i.e. 
[KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
 and 
[ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
 discard checkpoints if the checkpoint metadata wasn't written to the backend. 

This does not cover the cases where the data was actually written to the 
backend but the call failed anyway (e.g. due to network issues). In such a 
case, we might end up having a pointer in the backend pointing to a checkpoint 
that was discarded.

Instead of discarding the checkpoint data in this case, we might want to keep 
it for this specific use case. Otherwise, we might run into Exceptions when 
recovering from the Checkpoint later on. We might want to add a warning to the 
user pointing to the possibly orphaned checkpoint data.

  was:
Both {{StateHandleStore}} implementations (i.e. 
[KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
 and 
[ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
 discard checkpoints if the checkpoint metadata wasn't written to the backend. 

This does not cover the cases where the data was actually written to the 
backend but the call failed anyway (e.g. due to network issues). In such a 
case, we might end up having a pointer in the backend pointing to a checkpoint 
that was discarded.

Instead of discarding the checkpoint data in this case, we might want to keep 
it for this specific use case. Otherwise, we might run into Exceptions when 
recovering from the Checkpoint later on.


> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
>                 Key: FLINK-22494
>                 URL: https://issues.apache.org/jira/browse/FLINK-22494
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0, 1.12.3
>            Reporter: Matthias
>            Priority: Critical
>             Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e. 
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
>  and 
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
>  discard checkpoints if the checkpoint metadata wasn't written to the 
> backend. 
> This does not cover the cases where the data was actually written to the 
> backend but the call failed anyway (e.g. due to network issues). In such a 
> case, we might end up having a pointer in the backend pointing to a 
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep 
> it for this specific use case. Otherwise, we might run into Exceptions when 
> recovering from the Checkpoint later on. We might want to add a warning to 
> the user pointing to the possibly orphaned checkpoint data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22494) Avoid discarding checkpoints in case of failure

Reply via email to