[jira] [Commented] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

ming li (Jira) Fri, 31 Jul 2020 01:12:20 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168508#comment-17168508
 ]


ming li commented on FLINK-18451:
---------------------------------

Hi,[~trohrmann].I have learned that no matter in the at-least-once or 
exactly-once scenario, this double running should not be a problem (but some 
additional guarantees are required).
In fact, in our production environment, we have a message middleware similar to 
Kafka. Different from kafka, it can only be partitioned and allocated by the 
server according to the consumer group. Each partition can only be assigned to 
one consumer. At this time, in the dual-run scenario, some consumers will not 
be able to obtain partitions. We can only allocate new partitions for 
consumption until the original task fails. At this time, some data will be 
consumed by the old consumer. As a result, data loss occurs.

> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18451
>                 URL: https://issues.apache.org/jira/browse/FLINK-18451
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: ming li
>            Priority: Major
>              Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored 
> by Yarn's ResourceManager, and the Leader node will be registered on 
> Zookeeper. The original TaskManager will find the new JobManager through 
> Zookeeper and close the old JobManager connection. At this time, all tasks of 
> the TaskManager will fail. The new JobManager will directly perform job 
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally 
> connected to Zookeeper, it is not registered with the new JobManager in time. 
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in 
> TaskManager). Assuming that HA recovers fast enough, some Task double runs 
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated 
> during the runtime, and use it to judge all Task stops when HA is restored?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

Reply via email to