[ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597279#comment-17597279 ]
Matthias Pohl edited comment on FLINK-28078 at 8/29/22 4:24 PM: ---------------------------------------------------------------- CURATOR-645 is probably caused by some issue where we revoke the leadership in the test "too fast" which makes the curator code end up in a different code path that contains a bug. There's already a fix for that in [a PR in the CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure how fast we're going to get this merged. I started experimenting with some temporary (dirty) workaround for our test to make it less likely to fail. I'm suspecting that we only need to "add some workload on the leader's side" to throttle the leadership revocation. Still, this issue can happen in production as well. There's a race condition between the leader losing its leadership and candidates going through an re-evaluation of the leadership on their end ({{LeaderLatch#getChildren}} is called before the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called after the leader's znode is deleted). We can only overcome this by fixing CURATOR-645 and upgrading to the corresponding Apache Curator version. I would still leave the Jira issue as {{Major}} because of nobody having it reported by now. The test case made the issue only visible because the race condition because more likely in the test with no actual workload being processed by the leader process. was (Author: mapohl): CURATOR-645 is probably caused by some issue where we revoke the leadership in the test "too fast" which makes the curator code end up in a different code path that contains a bug. There's already a fix for that in [a PR in the CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure how fast we're going to get this merged. I started experimenting with some temporary (dirty) workaround for our test to make it less likely to fail. I'm suspecting that we only need to "add some workload on the leader's side" to throttle the leadership revocation. Still, this issue can happen in production as well. There's a race condition between the leader losing its leadership and candidates going through an re-evaluation of the leadership on their end ({{LeaderLatch#getChildren}} is called before the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called after the leader's znode is deleted). I would still leave the Jira issue as {{Major}} because of nobody having it reported by now. The test case made the issue only visible because the race condition because more likely in the test with no actual workload being processed by the leader process. > ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers > runs into timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-28078 > URL: https://issues.apache.org/jira/browse/FLINK-28078 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0, 1.15.2 > Reporter: Matthias Pohl > Assignee: Matthias Pohl > Priority: Major > Labels: pull-request-available, stale-assigned, test-stability > > [Build > #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] > got stuck in > {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}} > {code} > "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 > tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000] > May 30 16:36:10 java.lang.Thread.State: WAITING (parking) > May 30 16:36:10 at sun.misc.Unsafe.park(Native Method) > May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a > java.util.concurrent.CompletableFuture$Signaller) > May 30 16:36:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 30 16:36:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > May 30 16:36:10 at > org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256) > May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > May 30 16:36:10 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 30 16:36:10 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)