[
https://issues.apache.org/jira/browse/FLINK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469226#comment-17469226
]
Till Rohrmann edited comment on FLINK-25486 at 1/5/22, 11:38 AM:
-----------------------------------------------------------------
Hi [~Jiangang], thanks for reporting this issue. I think this is indeed a bug
and should be fixed. The problem seems as you described that the
{{MiniDispatcher}} completes the {{shutDownFuture}} not only on globally
terminal states.
Do you want to work on it?
cc [~dmvk].
was (Author: till.rohrmann):
Hi [~Jiangang], thanks for reporting this issue. I think this is indeed a bug
and should be fixed. Do you want to work on it? How will you fix it?
cc [~dmvk].
> Perjob can not recover from checkpoint when zookeeper leader changes
> --------------------------------------------------------------------
>
> Key: FLINK-25486
> URL: https://issues.apache.org/jira/browse/FLINK-25486
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.13.5, 1.14.2
> Reporter: Liu
> Priority: Critical
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
>
> When the config
> high-availability.zookeeper.client.tolerate-suspended-connections is default
> false, the appMaster will failover once zk leader changes. In this case, the
> old appMaster will clean up all the zk info and the new appMaster will not
> recover from the latest checkpoint.
> The process is as following:
> # Start a perJob application.
> # kill zk's leade node which cause the perJob to suspend.
> # In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is
> set to UNKNOWN .
> # The future is transferred to ClusterEntrypoint, the method is called with
> cleanupHaData true.
> # Clean up zk data and exit.
> # The new appMaster will not find any checkpoints to start and the state is
> lost.
> Since the job can recover automatically when the zk leader changes, it is
> reasonable to keep zk info for the coming recovery.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)