Github user StefanRRichter commented on a diff in the pull request: https://github.com/apache/flink/pull/4879#discussion_r146159867 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java --- @@ -163,22 +162,50 @@ public void recover() throws Exception { LOG.info("Found {} checkpoints in ZooKeeper.", numberOfInitialCheckpoints); - for (Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String> checkpointStateHandle : initialCheckpoints) { + // Try and read the state handles from storage. We try until we either successfully read + // all of them or when we reach a stable state, i.e. when successfully read the same set + // of checkpoints in two tries. --- End diff -- Maybe we could enhance to comment to talk about the reasons why this code is written in a certain way (DFS outage) and maybe also include a word about incremental checkpoints to help future maintainers and/or our own memories. Unfortunately this also signals that this code has to consider lots of implicit side effects :-(
---