[
https://issues.apache.org/jira/browse/FLINK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335271#comment-17335271
]
Till Rohrmann commented on FLINK-22494:
---------------------------------------
cc [~pnowojski]
> Avoid discarding checkpoints in case of failure
> -----------------------------------------------
>
> Key: FLINK-22494
> URL: https://issues.apache.org/jira/browse/FLINK-22494
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.13.0, 1.14.0, 1.12.3
> Reporter: Matthias
> Priority: Critical
> Fix For: 1.14.0, 1.13.1, 1.12.4
>
>
> Both {{StateHandleStore}} implementations (i.e.
> [KubernetesStateHandleStore:157|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java#L157]
> and
> [ZooKeeperStateHandleStore:170|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L170])
> discard checkpoints if the checkpoint metadata wasn't written to the
> backend.
> This does not cover the cases where the data was actually written to the
> backend but the call failed anyway (e.g. due to network issues). In such a
> case, we might end up having a pointer in the backend pointing to a
> checkpoint that was discarded.
> Instead of discarding the checkpoint data in this case, we might want to keep
> it for this specific use case. Otherwise, we might run into Exceptions when
> recovering from the Checkpoint later on. We might want to add a warning to
> the user pointing to the possibly orphaned checkpoint data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)