[ 
https://issues.apache.org/jira/browse/FLINK-26121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492574#comment-17492574
 ] 

Matthias Pohl commented on FLINK-26121:
---------------------------------------

I wasn't able to reproduce it locally. But it looks like a connection 
instability to the ZooKeeper server. The test behaves as expected until the we 
wait for the new leader to be picked up. The connection seems to be lost which 
triggers an additional connection state change (suspended -> reconnect) which 
adds another leader address change to the blocking queue of the test's 
{{QueueLeaderElectionListener}}. It looks like a race condition between the 
thread that is blocked in the queue's put operation and the close operation 
that interrupts all threads while shutting down the ExecutorServices.
{code}
[...]
21:43:16,346 [    main-EventThread] INFO  
org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager
 [] - State change: SUSPENDED
21:43:16,346 [Curator-ConnectionStateManager-0] WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Connection to ZooKeeper suspended, waiting for reconnection.
21:43:18,194 [    main-EventThread] INFO  
org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager
 [] - State change: RECONNECTED
21:43:18,222 [    main-EventThread] DEBUG 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Leader node has changed.
21:43:21,353 [Curator-ConnectionStateManager-0] INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
21:43:21,353 [Curator-ConnectionStateManager-0] DEBUG 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Leader node has changed.
21:43:21,453 [                main] INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - 
Closing 
ZookeeperLeaderRetrievalDriver{connectionInformationPath='/testNewLeaderAfterReconnectTriggersListenerNotification/connection_info'}.
21:43:21,462 [ Curator-Framework-0] INFO  
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl
 [] - backgroundOperationsLoop exiting
21:43:21,463 [Curator-ConnectionStateManager-0] ERROR 
org.apache.flink.runtime.util.TestingFatalErrorHandler       [] - OnFatalError:
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not 
handle node changed event.
{code}

> ZooKeeperLeaderRetrievalConnectionHandlingTest.testNewLeaderAfterReconnectTriggersListenerNotification
>  failed on azure
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26121
>                 URL: https://issues.apache.org/jira/browse/FLINK-26121
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Yun Gao
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: test-stability
>
> {code:java}
> 2022-02-11T21:43:35.4936452Z Feb 11 21:43:35 java.lang.AssertionError: The 
> TestingFatalErrorHandler caught an exception.
> 2022-02-11T21:43:35.4940444Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.after(TestingFatalErrorHandlerResource.java:81)
> 2022-02-11T21:43:35.4941937Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.access$300(TestingFatalErrorHandlerResource.java:36)
> 2022-02-11T21:43:35.4943249Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$1.evaluate(TestingFatalErrorHandlerResource.java:60)
> 2022-02-11T21:43:35.4944745Z Feb 11 21:43:35  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2022-02-11T21:43:35.4945682Z Feb 11 21:43:35  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> 2022-02-11T21:43:35.4946655Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 2022-02-11T21:43:35.4947847Z Feb 11 21:43:35  at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> 2022-02-11T21:43:35.4948876Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> 2022-02-11T21:43:35.4949842Z Feb 11 21:43:35  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> 2022-02-11T21:43:35.4951142Z Feb 11 21:43:35  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> 2022-02-11T21:43:35.4952153Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 2022-02-11T21:43:35.4953115Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 2022-02-11T21:43:35.4954068Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 2022-02-11T21:43:35.4955003Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 2022-02-11T21:43:35.4955981Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 2022-02-11T21:43:35.4956930Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 2022-02-11T21:43:35.4958008Z Feb 11 21:43:35  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 2022-02-11T21:43:35.4958899Z Feb 11 21:43:35  at 
> org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> 2022-02-11T21:43:35.4959774Z Feb 11 21:43:35  at 
> org.junit.runner.JUnitCore.run(JUnitCore.java:115)
> 2022-02-11T21:43:35.4960911Z Feb 11 21:43:35  at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
> 2022-02-11T21:43:35.4962095Z Feb 11 21:43:35  at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
> 2022-02-11T21:43:35.4963136Z Feb 11 21:43:35  at 
> org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72)
> 2022-02-11T21:43:35.4964275Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
> 2022-02-11T21:43:35.4965527Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
> 2022-02-11T21:43:35.4966787Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
> 2022-02-11T21:43:35.4968228Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
> 2022-02-11T21:43:35.4969485Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
> 2022-02-11T21:43:35.4970753Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
> 2022-02-11T21:43:35.4971842Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
> 2022-02-11T21:43:35.4973291Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
> 2022-02-11T21:43:35.4974538Z Feb 11 21:43:35  at 
> org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
> 2022-02-11T21:43:35.4975737Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:188)
> 2022-02-11T21:43:35.4976936Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
> 2022-02-11T21:43:35.4978291Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
> 2022-02-11T21:43:35.4979588Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
> 2022-02-11T21:43:35.4980728Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
> 2022-02-11T21:43:35.4981753Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
> 2022-02-11T21:43:35.4982863Z Feb 11 21:43:35  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
> 2022-02-11T21:43:35.4983958Z Feb 11 21:43:35 Caused by: 
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not 
> handle node changed event.
> 2022-02-11T21:43:35.4985193Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.retrieveLeaderInformationFromZooKeeper(ZooKeeperLeaderRetrievalDriver.java:143)
> 2022-02-11T21:43:35.4986451Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.onReconnectedConnectionState(ZooKeeperLeaderRetrievalDriver.java:181)
> 2022-02-11T21:43:35.4987902Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.handleStateChange(ZooKeeperLeaderRetrievalDriver.java:164)
> 2022-02-11T21:43:35.4989235Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.lambda$new$0(ZooKeeperLeaderRetrievalDriver.java:61)
> 2022-02-11T21:43:35.4990632Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:279)
> 2022-02-11T21:43:35.4991926Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
> 2022-02-11T21:43:35.4993137Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
> 2022-02-11T21:43:35.4994438Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
> 2022-02-11T21:43:35.4995807Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:279)
> 2022-02-11T21:43:35.4997130Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
> 2022-02-11T21:43:35.4999203Z Feb 11 21:43:35  at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
> 2022-02-11T21:43:35.4999927Z Feb 11 21:43:35  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2022-02-11T21:43:35.5000679Z Feb 11 21:43:35  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2022-02-11T21:43:35.5001306Z Feb 11 21:43:35  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2022-02-11T21:43:35.5001866Z Feb 11 21:43:35  at 
> java.lang.Thread.run(Thread.java:748)
> 2022-02-11T21:43:35.5002381Z Feb 11 21:43:35 Caused by: 
> java.lang.IllegalStateException: java.lang.InterruptedException
> 2022-02-11T21:43:35.5003374Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalConnectionHandlingTest$QueueLeaderElectionListener.notifyLeaderAddress(ZooKeeperLeaderRetrievalConnectionHandlingTest.java:375)
> 2022-02-11T21:43:35.5004370Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.retrieveLeaderInformationFromZooKeeper(ZooKeeperLeaderRetrievalDriver.java:136)
> 2022-02-11T21:43:35.5005005Z Feb 11 21:43:35  ... 14 more
> 2022-02-11T21:43:35.5005464Z Feb 11 21:43:35 Caused by: 
> java.lang.InterruptedException
> 2022-02-11T21:43:35.5006079Z Feb 11 21:43:35  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> 2022-02-11T21:43:35.5006973Z Feb 11 21:43:35  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
> 2022-02-11T21:43:35.5007877Z Feb 11 21:43:35  at 
> java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
> 2022-02-11T21:43:35.5008787Z Feb 11 21:43:35  at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalConnectionHandlingTest$QueueLeaderElectionListener.notifyLeaderAddress(ZooKeeperLeaderRetrievalConnectionHandlingTest.java:373)
> 2022-02-11T21:43:35.5009547Z Feb 11 21:43:35  ... 15 more
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31294&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=7581



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to