[
https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214255#comment-16214255
]
ASF GitHub Bot commented on FLINK-7783:
---------------------------------------
GitHub user aljoscha opened a pull request:
https://github.com/apache/flink/pull/4879
[FLINK-7783] Don't always remove checkpoints in
ZooKeeperCompletedCheckpointStore#recover()
I think this will be the final version for what I started in #4863.
Now, the code will retrieve checkpoints and succeed if either all of them
area read or of two successive tries read the same set of checkpoints.
This doesn't duplicate the test anymore but still leaves the questionable
(lack of) separation of concerns in the store.
R: @StefanRRichter, @tillrohrmann
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aljoscha/flink
jira-7783-zookeeper-state-store-fix-simplified3
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4879.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4879
----
commit ca189b4c44810229331332e397523cba5417b4d6
Author: Aljoscha Krettek <[email protected]>
Date: 2017-10-22T09:40:43Z
[FLINK-7783] Don't always remove checkpoints in
ZooKeeperCompletedCheckpointStore#recover()
----
> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>
> Key: FLINK-7783
> URL: https://issues.apache.org/jira/browse/FLINK-7783
> Project: Flink
> Issue Type: Sub-task
> Components: State Backends, Checkpointing
> Affects Versions: 1.4.0, 1.3.2
> Reporter: Aljoscha Krettek
> Assignee: Aljoscha Krettek
> Priority: Blocker
> Fix For: 1.4.0, 1.3.3
>
>
> Currently, we always delete checkpoint handles if they (or the data from the
> DFS) cannot be read:
> https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e.
> we could inadvertently
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list:
> https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)