[ 
https://issues.apache.org/jira/browse/CURATOR-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated CURATOR-645:
----------------------------------
    Description: 
We experienced a strange behavior of the LeaderLatch in a test case in Apache 
Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for 
the leadership resulting in an infinite loop.

The test includes three instances of a wrapper class that has a {{LeaderLatch}} 
as a member. This is about 
[ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236].
 In the test, the first {{LeaderLatch}} acquires the leadership, which results 
in the {{LeaderLatch}} to be closed and, as a consequence, losing the 
leadership. The odd thing now is that the two left-over {{LeaderLatch}} 
instances end up in a infinite loop as shown in the ZooKeeper server logs:
{code:java}
16:17:07,864 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 
reqpath:n/a
16:17:07,866 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd 
txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - 
sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 
reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG 
org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 
zxid:0xfffffffffffffffe txntype:unknown 
reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
{code}
It looks like the close call of the {{LeaderLatch}} with the initial leadership 
is in some kind of race condition deleting the corresponding ZNode and the 
watcher triggering {{reset()}} for the left-over {{LeaderLatch}} instances 
instead of retrieving the left-over children:
 # The {{reset()}} triggers 
[getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629]
 through the 
[LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525]
 after a new child is created (I would assume {{create2}} entry in the logs 
before {{getChildren}} entry which is not the case; so, I might be wrong in my 
observation)
 # The callback of {{getChildren}} triggers 
[checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
 # In the meantime, the predecessor gets deleted (I'd assume because of the 
deterministic ordering of the events in ZK). This causes the [callback in 
checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607]
 to fail with a {{NONODE}} event and triggering the reset of the current 
{{LeaderLatch}} instance which again triggers the deletion of the current's 
{{{}LeaderLatch{}}}'s child zNode and which is executed on the server later on.

  was:We experienced a strange behavior of the LeaderLatch in a test case in 
Apache Flink (see FLINK-28078) where two {{LeaderLatch}} instances are 
competing for the leadership resulting in an infinite loop.


> LeaderLatch generates infinite loop with two LeaderLatch instances competing 
> for the leadership
> -----------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-645
>                 URL: https://issues.apache.org/jira/browse/CURATOR-645
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 5.2.0
>            Reporter: Matthias Pohl
>            Priority: Major
>
> We experienced a strange behavior of the LeaderLatch in a test case in Apache 
> Flink (see FLINK-28078) where two {{LeaderLatch}} instances are competing for 
> the leadership resulting in an infinite loop.
> The test includes three instances of a wrapper class that has a 
> {{LeaderLatch}} as a member. This is about 
> [ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers|https://github.com/apache/flink/blob/7d85b273ccdbd5a2242e05e5d645ea82280f5eea/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperMultipleComponentLeaderElectionDriverTest.java#L236].
>  In the test, the first {{LeaderLatch}} acquires the leadership, which 
> results in the {{LeaderLatch}} to be closed and, as a consequence, losing the 
> leadership. The odd thing now is that the two left-over {{LeaderLatch}} 
> instances end up in a infinite loop as shown in the ZooKeeper server logs:
> {code:java}
> 16:17:07,864 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
> zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,864 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 
> zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
> 16:17:07,866 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc 
> txntype:2 reqpath:n/a
> 16:17:07,866 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 
> reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd 
> txntype:15 reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - 
> sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 
> reqpath:n/a
> 16:17:07,869 [        SyncThread:0] DEBUG 
> org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing 
> request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 
> zxid:0xfffffffffffffffe txntype:unknown 
> reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
> {code}
> It looks like the close call of the {{LeaderLatch}} with the initial 
> leadership is in some kind of race condition deleting the corresponding ZNode 
> and the watcher triggering {{reset()}} for the left-over {{LeaderLatch}} 
> instances instead of retrieving the left-over children:
>  # The {{reset()}} triggers 
> [getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L629]
>  through the 
> [LeaderLatch#getChildren|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L525]
>  after a new child is created (I would assume {{create2}} entry in the logs 
> before {{getChildren}} entry which is not the case; so, I might be wrong in 
> my observation)
>  # The callback of {{getChildren}} triggers 
> [checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625].
>  # In the meantime, the predecessor gets deleted (I'd assume because of the 
> deterministic ordering of the events in ZK). This causes the [callback in 
> checkLeadership|https://github.com/apache/curator/blob/d1a9234ecae47e3704037c839e6041931c24d1f4/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L607]
>  to fail with a {{NONODE}} event and triggering the reset of the current 
> {{LeaderLatch}} instance which again triggers the deletion of the current's 
> {{{}LeaderLatch{}}}'s child zNode and which is executed on the server later 
> on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to