[jira] [Comment Edited] (FLINK-22494) Avoid discarding checkpoints in case of failure

Matthias (Jira) Tue, 27 Apr 2021 09:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333365#comment-17333365
 ]


Matthias edited comment on FLINK-22494 at 4/27/21, 4:29 PM:
------------------------------------------------------------

[~fly_in_gis]: I had the discussion with [~fabian.paul] and [~trohrmann] about 
this issue and we came to the conclusion that having an orphaned Checkpoint 
pointer in the ConfigMap isn't the best solution. Hence, I created this ticket 
to handle the this case.

We might only remove the discard in case of failure since we cannot be sure 
whether the data was actually written to the backend or not in case of failure.


was (Author: mapohl):
CC [~fly_in_gis]: We came to the conclusion that having an orphaned Checkpoint 
pointer in the ConfigMap isn't the best solution. Hence, we would prefer having 
orphaned checkpoint data over inconsistent state in Flink.

> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
>                 Key: FLINK-22494
>                 URL: https://issues.apache.org/jira/browse/FLINK-22494
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0, 1.12.3
>            Reporter: Matthias
>            Priority: Critical
>             Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e. 
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
>  and 
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
>  discard checkpoints if the checkpoint metadata wasn't written to the 
> backend. 
> This does not cover the cases where the data was actually written to the 
> backend but the call failed anyway (e.g. due to network issues). In such a 
> case, we might end up having a pointer in the backend pointing to a 
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep 
> it for this specific use case. Otherwise, we might run into Exceptions when 
> recovering from the Checkpoint later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-22494) Avoid discarding checkpoints in case of failure

Reply via email to