[
https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214582#comment-16214582
]
ASF GitHub Bot commented on FLINK-7783:
---------------------------------------
Github user StefanRRichter commented on a diff in the pull request:
https://github.com/apache/flink/pull/4879#discussion_r146160129
--- Diff:
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java
---
@@ -163,22 +162,50 @@ public void recover() throws Exception {
--- End diff --
In line 156, I found code that considers concurrent modifications in the
`ZookeeperStateHandleStore`. Just for discussion, I wonder what happens with
concurrent modifications after we retrieved `initialCheckpoints`. Couldn't this
mean that after we became `initialCheckpoints` (maybe of size 1), this becomes
modified and the only handle is suddenly invalid and we can no longer recover?
Maybe the code was just written in an overprotective way and this is a
non-issue?
> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>
> Key: FLINK-7783
> URL: https://issues.apache.org/jira/browse/FLINK-7783
> Project: Flink
> Issue Type: Sub-task
> Components: State Backends, Checkpointing
> Affects Versions: 1.4.0, 1.3.2
> Reporter: Aljoscha Krettek
> Assignee: Aljoscha Krettek
> Priority: Blocker
> Fix For: 1.4.0, 1.3.3
>
>
> Currently, we always delete checkpoint handles if they (or the data from the
> DFS) cannot be read:
> https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e.
> we could inadvertently
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list:
> https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)