[
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17174148#comment-17174148
]
Till Rohrmann edited comment on FLINK-18451 at 8/10/20, 7:38 AM:
-----------------------------------------------------------------
But what if the {{TaskExecutor}} has already stopped the tasks because of some
heartbeat timeout or a suspension of the ZooKeeper connection? Then it would
not report anything back to the {{JobMaster}}.
was (Author: till.rohrmann):
But what if the {{TaskExecutor}} has already stopped the tasks because of some
heartbeat timeout? Then it would not report anything back to the {{JobMaster}}.
> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
> Key: FLINK-18451
> URL: https://issues.apache.org/jira/browse/FLINK-18451
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.9.0
> Reporter: ming li
> Priority: Major
> Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored
> by Yarn's ResourceManager, and the Leader node will be registered on
> Zookeeper. The original TaskManager will find the new JobManager through
> Zookeeper and close the old JobManager connection. At this time, all tasks of
> the TaskManager will fail. The new JobManager will directly perform job
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally
> connected to Zookeeper, it is not registered with the new JobManager in time.
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in
> TaskManager). Assuming that HA recovers fast enough, some Task double runs
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated
> during the runtime, and use it to judge all Task stops when HA is restored?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)