Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4879#discussion_r146159867
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java
 ---
    @@ -163,22 +162,50 @@ public void recover() throws Exception {
     
                LOG.info("Found {} checkpoints in ZooKeeper.", 
numberOfInitialCheckpoints);
     
    -           for (Tuple2<RetrievableStateHandle<CompletedCheckpoint>, 
String> checkpointStateHandle : initialCheckpoints) {
    +           // Try and read the state handles from storage. We try until we 
either successfully read
    +           // all of them or when we reach a stable state, i.e. when 
successfully read the same set
    +           // of checkpoints in two tries.
    --- End diff --
    
    Maybe we could enhance to comment to talk about the reasons why this code 
is written in a certain way (DFS outage) and maybe also include a word about 
incremental checkpoints to help future maintainers and/or our own memories. 
Unfortunately this also signals that this code has to consider lots of implicit 
side effects :-(


---

Reply via email to