[
https://issues.apache.org/jira/browse/FLINK-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228330#comment-17228330
]
Yang Wang commented on FLINK-20045:
-----------------------------------
I think the root cause is the leader was granted too fast and the
{{TestingLeaderElectionEventHandler#init()}} has not been called. So we could
find the following exception in the maven log. This could not happen in the
production code since we have a {{lock}} in {{DefaultLeaderElectionService}}.
How to fix the unstable tests?
I suggest to add a "wait-with-timeout" in the
{{TestingLeaderElectionEventHandler}} so that we have enough time for creating
{{LeaderElectionDriver}} and then {{init}} the
{{TestingLeaderElectionEventHandler}}.
{code:java}
10:30:37,419 [Curator-LeaderLatch-0] WARN
org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] - The
version of ZooKeeper being used doesn't support Container nodes.
CreateMode.PERSISTENT will be used instead.10:30:37,419 [Curator-LeaderLatch-0]
WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths [] -
The version of ZooKeeper being used doesn't support Container nodes.
CreateMode.PERSISTENT will be used instead.10:30:37,468 [ main-EventThread]
ERROR
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer
[] - Listener (ZooKeeperLeaderElectionDriver{leaderPath='/leader'}) threw an
exceptionorg.apache.flink.util.FlinkRuntimeException: init() should be called
first. at
org.apache.flink.runtime.leaderelection.TestingLeaderElectionEventHandler.onGrantLeadership(TestingLeaderElectionEventHandler.java:46)
~[test-classes/:?] at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:158)
~[classes/:?] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693)
~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689)
~[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0] at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[flink-shaded-zookeeper-3-3.4.14-12.0.jar:3.4.14-12.0]10:33:58,379 [
main] INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] -
Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader'}10:33:58,383 [
main] INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] -
Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader'}.10:33:58,385 [
Curator-Framework-0] INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl
[] - backgroundOperationsLoop exiting10:33:58,388 [ main] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session:
0x102de7dc7fc0000 closed10:33:58,389 [ main-EventThread] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
EventThread shut down for session: 0x102de7dc7fc000010:33:58,392 [
main] ERROR
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest [] -
{code}
> ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval failed with
> "TimeoutException: Contender was not elected as the leader within 200000ms"
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-20045
> URL: https://issues.apache.org/jira/browse/FLINK-20045
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9251&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> 2020-11-07T10:34:07.5063203Z [ERROR]
> testZooKeeperLeaderElectionRetrieval(org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest)
> Time elapsed: 202.445 s <<< ERROR!
> 2020-11-07T10:34:07.5064331Z java.util.concurrent.TimeoutException: Contender
> was not elected as the leader within 200000ms
> 2020-11-07T10:34:07.5064946Z at
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:153)
> 2020-11-07T10:34:07.5065762Z at
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:139)
> 2020-11-07T10:34:07.5066565Z at
> org.apache.flink.runtime.leaderelection.TestingLeaderBase.waitForLeader(TestingLeaderBase.java:48)
> 2020-11-07T10:34:07.5067185Z at
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionTest.testZooKeeperLeaderElectionRetrieval(ZooKeeperLeaderElectionTest.java:144)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)