[ 
https://issues.apache.org/jira/browse/FLINK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901975#comment-15901975
 ] 

ASF GitHub Bot commented on FLINK-5940:
---------------------------------------

Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3446#discussion_r105021165
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java
 ---
    @@ -226,16 +200,43 @@ public CompletedCheckpoint getLatestCheckpoint() 
throws Exception {
                        return null;
                }
                else {
    -                   return 
checkpointStateHandles.getLast().f0.retrieveState();
    +                   while(!checkpointStateHandles.isEmpty()) {
    +                           
Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String> 
checkpointStateHandle = checkpointStateHandles.peekLast();
    +
    +                           try {
    +                                   return 
retrieveCompletedCheckpoint(checkpointStateHandle);
    +                           } catch (FlinkException e) {
    --- End diff --
    
    I would catch more than `FlinkException` here - after all, we want to fall 
back to earlier checkpoints in any error case, no?


> ZooKeeperCompletedCheckpointStore cannot handle broken state handles
> --------------------------------------------------------------------
>
>                 Key: FLINK-5940
>                 URL: https://issues.apache.org/jira/browse/FLINK-5940
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.1.4, 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> The {{ZooKeeperCompletedCheckpointStore}} reads a set of 
> {{RetrievableStateHandles}} from ZooKeeper upon recovery. It then tries to 
> retrieve the {{CompletedCheckpoint}} from the latest state handle. If the 
> retrieve operation fails, then the whole recovery of completed checkpoints 
> fails even though the store might have read older state handles from 
> ZooKeeper. 
> I propose to harden the behaviour by removing broken state handles and 
> returning the first successfully retrieved {{CompletedCheckpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to