[
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334528#comment-17334528
]
Yang Wang commented on FLINK-22494:
-----------------------------------
[~mapohl] I agree with you that it is not a good behavior to have such orphaned
checkpoint pointer in the ZNode or ConfigMap.
Given that it could happen the ZK/K8s client failed with exception but the data
was actually written to the ZNode/ConfigMap. I am not sure about how to
guarantee that the data on the DFS is only discarded when writing ConfigMap/ZK
has true failure. Do you mean we need to check the existence of ZNode /
ConfigMap key before discarding the state on DFS?
> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
> Key: FLINK-22494
> URL: https://issues.apache.org/jira/browse/FLINK-22494
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.13.0, 1.14.0, 1.12.3
> Reporter: Matthias
> Priority: Critical
> Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e.
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
> and
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
> discard checkpoints if the checkpoint metadata wasn't written to the
> backend.
> This does not cover the cases where the data was actually written to the
> backend but the call failed anyway (e.g. due to network issues). In such a
> case, we might end up having a pointer in the backend pointing to a
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep
> it for this specific use case. Otherwise, we might run into Exceptions when
> recovering from the Checkpoint later on. We might want to add a warning to
> the user pointing to the possibly orphaned checkpoint data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)