[ 
https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597279#comment-17597279
 ] 

Matthias Pohl edited comment on FLINK-28078 at 8/29/22 4:24 PM:
----------------------------------------------------------------

CURATOR-645 is probably caused by some issue where we revoke the leadership in 
the test "too fast" which makes the curator code end up in a different code 
path that contains a bug. There's already a fix for that in [a PR in the 
CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure 
how fast we're going to get this merged.

I started experimenting with some temporary (dirty) workaround for our test to 
make it less likely to fail. I'm suspecting that we only need to "add some 
workload on the leader's side" to throttle the leadership revocation. Still, 
this issue can happen in production as well. There's a race condition between 
the leader losing its leadership and candidates going through an re-evaluation 
of the leadership on their end ({{LeaderLatch#getChildren}} is called before 
the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called 
after the leader's znode is deleted). We can only overcome this by fixing 
CURATOR-645 and upgrading to the corresponding Apache Curator version.

I would still leave the Jira issue as {{Major}} because of nobody having it 
reported by now. The test case made the issue only visible because the race 
condition because more likely in the test with no actual workload being 
processed by the leader process.


was (Author: mapohl):
CURATOR-645 is probably caused by some issue where we revoke the leadership in 
the test "too fast" which makes the curator code end up in a different code 
path that contains a bug. There's already a fix for that in [a PR in the 
CURATOR project|https://github.com/apache/curator/pull/430]. But I'm not sure 
how fast we're going to get this merged.

I started experimenting with some temporary (dirty) workaround for our test to 
make it less likely to fail. I'm suspecting that we only need to "add some 
workload on the leader's side" to throttle the leadership revocation. Still, 
this issue can happen in production as well. There's a race condition between 
the leader losing its leadership and candidates going through an re-evaluation 
of the leadership on their end ({{LeaderLatch#getChildren}} is called before 
the leader's znode is deleted but {{LeaderLatch#checkLeadership}} is called 
after the leader's znode is deleted).

I would still leave the Jira issue as {{Major}} because of nobody having it 
reported by now. The test case made the issue only visible because the race 
condition because more likely in the test with no actual workload being 
processed by the leader process.

> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
>  runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28078
>                 URL: https://issues.apache.org/jira/browse/FLINK-28078
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0, 1.15.2
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: pull-request-available, stale-assigned, test-stability
>
> [Build 
> #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455]
>  got stuck in 
> {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 
> tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10    java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10       at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10       - parking to wait for  <0x00000000c2571b80> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10       at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10       at 
> org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> May 30 16:36:10       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10       at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to