[
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171468#comment-17171468
]
Till Rohrmann commented on FLINK-18451:
---------------------------------------
How do you know on how many {{TaskManagers}} you have to wait before you can
say that you have seen all task information?
> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
> Key: FLINK-18451
> URL: https://issues.apache.org/jira/browse/FLINK-18451
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.9.0
> Reporter: ming li
> Priority: Major
> Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored
> by Yarn's ResourceManager, and the Leader node will be registered on
> Zookeeper. The original TaskManager will find the new JobManager through
> Zookeeper and close the old JobManager connection. At this time, all tasks of
> the TaskManager will fail. The new JobManager will directly perform job
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally
> connected to Zookeeper, it is not registered with the new JobManager in time.
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in
> TaskManager). Assuming that HA recovers fast enough, some Task double runs
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated
> during the runtime, and use it to judge all Task stops when HA is restored?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)