[
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331111#comment-17331111
]
Chen Qin commented on FLINK-10052:
----------------------------------
Set timeline on leader change to new address null exception.
In this time, when curator signaled zk suspened state, other code path
deregister task executor in other instance resulting restart.
Basically, when suspended message land to container 1, container 2 react with
TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093) and
exception out.
While it all point to supspended message handling, this part doesn't seems
directly touch changed code path.
Here is timeline of warn/exceptions
on container_e26_1617655625710_9571_01_000017
2021-04-23T13:57:37.290 - Connection to ZooKeeper suspended. Can no longer
retrieve the leader from ZooKeeper.
2021-04-23T13:57:37.304 - Connection to ZooKeeper suspended. Can no longer
retrieve the leader from ZooKeeper.
on container_e26_1617655625710_9571_01_000001
2021-04-23T13:57:37.333 - USER_EVENTS.spo_derived_event.SINK-stream_joiner ->
USER_EVENTS.spo_derived_event.SINK-late-event-tracker (32/270)
(c60dc612ec4d703d1bff646c3442193a) switched from RUNNING to FAILED on
container_e26_1617655625710_9571_01_000017 @
xenon-pii-dev-001-20191210-data-slave-dev-0a01fa8b.ec2.pin220.com
(dataPort=45229). org.apache.flink.util.FlinkException: ResourceManager leader
changed to new address null
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
on container_e26_1617655625710_9571_01_000017
2021-04-23T13:57:38.465 - Connection to ZooKeeper lost. Can no longer retrieve
the leader from ZooKeeper.
2021-04-23T13:57:38.496 - Unable to reconnect to ZooKeeper service, session
0x1050b21fe3006a6 has expired
> Tolerate temporarily suspended ZooKeeper connections
> ----------------------------------------------------
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
> Reporter: Till Rohrmann
> Assignee: Zili Chen
> Priority: Major
> Labels: pull-request-available, stale-assigned
> Fix For: 1.13.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator
> recipe for leader election. The leader latch revokes leadership in case of a
> suspended ZooKeeper connection. This can be premature in case that the system
> can reconnect to ZooKeeper before its session expires. The effect of the lost
> leadership is that all jobs will be canceled and directly restarted after
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper
> connection, it would be better to wait until the ZooKeeper connection is
> LOST. That way we would allow the system to reconnect and not lose the
> leadership. This could be achievable by using Curator's {{LeaderSelector}}
> instead of the {{LeaderLatch}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)