[jira] [Commented] (FLINK-22494) Avoid discarding checkpoints in case of failure

Yang Wang (Jira) Wed, 28 Apr 2021 00:43:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334528#comment-17334528
 ]


Yang Wang commented on FLINK-22494:
-----------------------------------

[~mapohl] I agree with you that it is not a good behavior to have such orphaned 
checkpoint pointer in the ZNode or ConfigMap.

Given that it could happen the ZK/K8s client failed with exception but the data 
was actually written to the ZNode/ConfigMap. I am not sure about how to 
guarantee that the data on the DFS is only discarded when writing ConfigMap/ZK 
has true failure. Do you mean we need to check the existence of ZNode / 
ConfigMap key before discarding the state on DFS?

> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
>                 Key: FLINK-22494
>                 URL: https://issues.apache.org/jira/browse/FLINK-22494
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0, 1.12.3
>            Reporter: Matthias
>            Priority: Critical
>             Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e. 
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
>  and 
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
>  discard checkpoints if the checkpoint metadata wasn't written to the 
> backend. 
> This does not cover the cases where the data was actually written to the 
> backend but the call failed anyway (e.g. due to network issues). In such a 
> case, we might end up having a pointer in the backend pointing to a 
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep 
> it for this specific use case. Otherwise, we might run into Exceptions when 
> recovering from the Checkpoint later on. We might want to add a warning to 
> the user pointing to the possibly orphaned checkpoint data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22494) Avoid discarding checkpoints in case of failure

Reply via email to