[
https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597279#comment-17597279
]
Matthias Pohl edited comment on FLINK-28078 at 8/29/22 4:24 PM:
----------------------------------------------------------------
CURATOR-645 is probably caused by some issue where we revoke the leadership in
the test "too fast" which makes the curator code end up in a different code
path that contains a bug. There's already a fix for that in [a PR in the
CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure
how fast we're going to get this merged.
I started experimenting with some temporary (dirty) workaround for our test to
make it less likely to fail. I'm suspecting that we only need to "add some
workload on the leader's side" to throttle the leadership revocation. Still,
this issue can happen in production as well. There's a race condition between
the leader losing its leadership and candidates going through an re-evaluation
of the leadership on their end ({{LeaderLatch#getChildren}} is called before
the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called
after the leader's znode is deleted). We can only overcome this by fixing
CURATOR-645 and upgrading to the corresponding Apache Curator version.
I would still leave the Jira issue as {{Major}} because of nobody having it
reported by now. The test case made the issue only visible because the race
condition because more likely in the test with no actual workload being
processed by the leader process.
was (Author: mapohl):
CURATOR-645 is probably caused by some issue where we revoke the leadership in
the test "too fast" which makes the curator code end up in a different code
path that contains a bug. There's already a fix for that in [a PR in the
CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure
how fast we're going to get this merged.
I started experimenting with some temporary (dirty) workaround for our test to
make it less likely to fail. I'm suspecting that we only need to "add some
workload on the leader's side" to throttle the leadership revocation. Still,
this issue can happen in production as well. There's a race condition between
the leader losing its leadership and candidates going through an re-evaluation
of the leadership on their end ({{LeaderLatch#getChildren}} is called before
the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called
after the leader's znode is deleted).
I would still leave the Jira issue as {{Major}} because of nobody having it
reported by now. The test case made the issue only visible because the race
condition because more likely in the test with no actual workload being
processed by the leader process.
> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
> runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-28078
> URL: https://issues.apache.org/jira/browse/FLINK-28078
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.0, 1.15.2
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Major
> Labels: pull-request-available, stale-assigned, test-stability
>
> [Build
> #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455]
> got stuck in
> {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0
> tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10 java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10 at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a
> java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10 at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10 at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10 at
> org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> May 30 16:36:10 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)