[jira] [Comment Edited] (FLINK-25486) Perjob can not recover from checkpoint when zookeeper leader changes

Till Rohrmann (Jira) Sun, 30 Jan 2022 08:24:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484182#comment-17484182
 ]


Till Rohrmann edited comment on FLINK-25486 at 1/30/22, 4:23 PM:
-----------------------------------------------------------------

Fixed via

1.15.0: 8ba13f37afb9164f3bb17de78c4b0d85b1633638
1.14.4: 0b519c24222f61306f9faba2389c0958daa9cc0a
1.13.6: fe5a1718368e62eb7ac47c00aabbd94173dae668


was (Author: till.rohrmann):
Fixed via

1.15.0: 8ba13f37afb9164f3bb17de78c4b0d85b1633638
1.14.4: (will be added)
1.13.6: (will be added)

> Perjob can not recover from checkpoint when zookeeper leader changes
> --------------------------------------------------------------------
>
>                 Key: FLINK-25486
>                 URL: https://issues.apache.org/jira/browse/FLINK-25486
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.5, 1.14.2
>            Reporter: Liu
>            Assignee: Liu
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.15.0, 1.13.6, 1.14.4
>
>
> When the config 
> high-availability.zookeeper.client.tolerate-suspended-connections is default 
> false, the appMaster will failover once zk leader changes. In this case, the 
> old appMaster will clean up all the zk info and the new appMaster will not 
> recover from the latest checkpoint.
> The process is as following:
>  # Start a perJob application.
>  # kill zk's leade node which cause the perJob to suspend.
>  # In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is 
> set to UNKNOWN .
>  # The future is transferred to ClusterEntrypoint, the method is called with 
> cleanupHaData true.
>  # Clean up zk data and exit.
>  # The new appMaster will not find any checkpoints to start and the state is 
> lost.
> Since the job can recover automatically when the zk leader changes, it is 
> reasonable to keep zk info for the coming recovery.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-25486) Perjob can not recover from checkpoint when zookeeper leader changes

Reply via email to