[jira] [Commented] (FLINK-22494) Avoid discarding checkpoints in case of failure

Till Rohrmann (Jira) Thu, 29 Apr 2021 02:10:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335271#comment-17335271
 ]


Till Rohrmann commented on FLINK-22494:
---------------------------------------

cc [~pnowojski]

> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
>                 Key: FLINK-22494
>                 URL: https://issues.apache.org/jira/browse/FLINK-22494
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0, 1.12.3
>            Reporter: Matthias
>            Priority: Critical
>             Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e. 
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
>  and 
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
>  discard checkpoints if the checkpoint metadata wasn't written to the 
> backend. 
> This does not cover the cases where the data was actually written to the 
> backend but the call failed anyway (e.g. due to network issues). In such a 
> case, we might end up having a pointer in the backend pointing to a 
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep 
> it for this specific use case. Otherwise, we might run into Exceptions when 
> recovering from the Checkpoint later on. We might want to add a warning to 
> the user pointing to the possibly orphaned checkpoint data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22494) Avoid discarding checkpoints in case of failure

Reply via email to