[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

Chen Qin (Jira) Fri, 23 Apr 2021 17:19:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331111#comment-17331111
 ]


Chen Qin commented on FLINK-10052:
----------------------------------

Set timeline on leader change to new address null exception. 
 In this time, when curator signaled zk suspened state, other code path 
deregister task executor in other instance resulting restart.

Basically, when suspended message land to container 1, container 2 react with 
TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093) and 
exception out.

While it all point to supspended message handling, this part doesn't seems 
directly touch changed code path.

 

Here is timeline of warn/exceptions 

on container_e26_1617655625710_9571_01_000017
 2021-04-23T13:57:37.290 - Connection to ZooKeeper suspended. Can no longer 
retrieve the leader from ZooKeeper.
 2021-04-23T13:57:37.304 - Connection to ZooKeeper suspended. Can no longer 
retrieve the leader from ZooKeeper.

on container_e26_1617655625710_9571_01_000001

2021-04-23T13:57:37.333 - USER_EVENTS.spo_derived_event.SINK-stream_joiner -> 
USER_EVENTS.spo_derived_event.SINK-late-event-tracker (32/270) 
(c60dc612ec4d703d1bff646c3442193a) switched from RUNNING to FAILED on 
container_e26_1617655625710_9571_01_000017 @ 
xenon-pii-dev-001-20191210-data-slave-dev-0a01fa8b.ec2.pin220.com 
(dataPort=45229). org.apache.flink.util.FlinkException: ResourceManager leader 
changed to new address null
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)

on container_e26_1617655625710_9571_01_000017 

2021-04-23T13:57:38.465 - Connection to ZooKeeper lost. Can no longer retrieve 
the leader from ZooKeeper.
 2021-04-23T13:57:38.496 - Unable to reconnect to ZooKeeper service, session 
0x1050b21fe3006a6 has expired

> Tolerate temporarily suspended ZooKeeper connections
> ----------------------------------------------------
>
>                 Key: FLINK-10052
>                 URL: https://issues.apache.org/jira/browse/FLINK-10052
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>            Reporter: Till Rohrmann
>            Assignee: Zili Chen
>            Priority: Major
>              Labels: pull-request-available, stale-assigned
>             Fix For: 1.13.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

Reply via email to