[
https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558357#comment-17558357
]
Matthias Pohl commented on FLINK-28078:
---------------------------------------
The loop consists of the following logs:
{code}
16:17:07,864 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] - Processing
request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21
zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] -
sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe
txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] - Processing
request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2
reqpath:n/a
16:17:07,866 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] -
sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] - Processing
request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd
txntype:15 reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] -
sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15
reqpath:n/a
16:17:07,869 [ SyncThread:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor [] - Processing
request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24
zxid:0xfffffffffffffffe txntype:unknown
reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
{code}
# The {{reset()}} triggers
[getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629]
through the
[LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525]
after a new child is created (I would assume {{create2}} entry in the logs
before {{getChildren}} entry which is not the case; so, I might be wrong in my
observation)
# The callback of {{getChildren}} triggers
[checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
# In the meantime, the predecessor gets deleted (I'd assume because of the
deterministic ordering of the events in ZK). This causes the [callback in
checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607]
to fail with a {{NONODE}} event and triggering the reset of the current
{{LeaderLatch}} instance which again triggers the deletion of the current's
{{LeaderLatch}}'s child zNode and which is executed on the server later on.
> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
> runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-28078
> URL: https://issues.apache.org/jira/browse/FLINK-28078
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.0
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Major
> Labels: test-stability
>
> [Build
> #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455]
> got stuck in
> {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0
> tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10 java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10 at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a
> java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10 at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10 at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10 at
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10 at
> org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> May 30 16:36:10 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)