[
https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212398#comment-16212398
]
ASF GitHub Bot commented on FLINK-7783:
---------------------------------------
GitHub user aljoscha opened a pull request:
https://github.com/apache/flink/pull/4870
[FLINK-7783] Don't always remove checkpoints in
ZooKeeperCompletedCheckpointStore#recover()
Alternative version of #4863.
This one actually works. #4863 is not working because I was deserialising
checkpoints on demand which is problematic because before checkpoints were
registered at the `SharedStateRegistry`. If we deserialise a checkpoint on
demand and call dispose on it (as #4863 does) this will potentially remove
shared state handles that are needed by the other handles.
This version also fails as soon as one handle cannot be read. If we don't
do this, we will break other incremental state handles because we drop their
shared state handles.
R: @StefanRRichter
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aljoscha/flink
jira-7783-zookeeper-state-store-fix-simplified2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4870.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4870
----
commit 4702bdd96a3baa844850bba47610c2a71ca7f2f1
Author: Aljoscha Krettek <[email protected]>
Date: 2017-10-19T19:26:20Z
[FLINK-7783] Don't always remove checkpoints in
ZooKeeperCompletedCheckpointStore#recover()
----
> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>
> Key: FLINK-7783
> URL: https://issues.apache.org/jira/browse/FLINK-7783
> Project: Flink
> Issue Type: Sub-task
> Components: State Backends, Checkpointing
> Affects Versions: 1.4.0, 1.3.2
> Reporter: Aljoscha Krettek
> Assignee: Aljoscha Krettek
> Priority: Blocker
> Fix For: 1.4.0, 1.3.3
>
>
> Currently, we always delete checkpoint handles if they (or the data from the
> DFS) cannot be read:
> https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e.
> we could inadvertently
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list:
> https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)