[ 
https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214579#comment-16214579
 ] 

ASF GitHub Bot commented on FLINK-7783:
---------------------------------------

Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4879#discussion_r146159867
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java
 ---
    @@ -163,22 +162,50 @@ public void recover() throws Exception {
     
                LOG.info("Found {} checkpoints in ZooKeeper.", 
numberOfInitialCheckpoints);
     
    -           for (Tuple2<RetrievableStateHandle<CompletedCheckpoint>, 
String> checkpointStateHandle : initialCheckpoints) {
    +           // Try and read the state handles from storage. We try until we 
either successfully read
    +           // all of them or when we reach a stable state, i.e. when 
successfully read the same set
    +           // of checkpoints in two tries.
    --- End diff --
    
    Maybe we could enhance to comment to talk about the reasons why this code 
is written in a certain way (DFS outage) and maybe also include a word about 
incremental checkpoints to help future maintainers and/or our own memories. 
Unfortunately this also signals that this code has to consider lots of implicit 
side effects :-(


> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-7783
>                 URL: https://issues.apache.org/jira/browse/FLINK-7783
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Aljoscha Krettek
>            Assignee: Aljoscha Krettek
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.3
>
>
> Currently, we always delete checkpoint handles if they (or the data from the 
> DFS) cannot be read: 
> https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e. 
> we could inadvertently
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list: 
> https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to