[
https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jordan Zimmerman resolved CURATOR-87.
-------------------------------------
Resolution: Not a Problem
I agree with Evaristo here. Also, please note, there has been other work on
background stability, etc. that may mitigate the OP's issues.
> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
> Key: CURATOR-87
> URL: https://issues.apache.org/jira/browse/CURATOR-87
> Project: Apache Curator
> Issue Type: Bug
> Components: Recipes
> Affects Versions: 2.2.0-incubating
> Environment: OS-X
> Reporter: Oliver Dain
> Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper
> becomes unreachable (due to network issues or something). I do know that I
> could maintain the same LeaderLatch instance and when ZK becomes reachable
> again it would re-negotiate leadership. However, for my particular use case
> this doesn't work and I have to release the LeaderLatch. Later, when ZK is
> available again I allocate a new LeaderLatch instance and call start() and on
> it. The bug is that this when await() is called on the new latch it
> immediately calls the isLeader callback and then almost immediately after the
> await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
> @Test
> public void leaderLatchJitters() throws Exception {
> TestingServer server = new TestingServer();
> CuratorFramework zkClient =
> CuratorFrameworkFactory.newClient(server.getConnectString(),
> new ExponentialBackoffRetry(1000, 3));
> zkClient.start();
> LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
> final AtomicInteger numIsLeader = new AtomicInteger(0);
> final AtomicInteger numNotLeader = new AtomicInteger(0);
> LeaderLatchListener lll = new LeaderLatchListener() {
> @Override
> public void isLeader() {
> log.debug("isLeader called");
> numIsLeader.incrementAndGet();
> }
> @Override
> public void notLeader() {
> log.debug("notLeader called");
> numNotLeader.incrementAndGet();
> }
> };
> leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
> leaderLatch.start();
> leaderLatch.await();
> assertTrue(leaderLatch.hasLeadership());
> assertEquals(1, numIsLeader.get());
> assertEquals(0, numNotLeader.get());
> // Shut down the server, wait for us to lose the lock, then restart
> File zkTmpDir = server.getTempDirectory();
> int zkServerPort = server.getPort();
> server.stop();
> while (leaderLatch.hasLeadership()) {
> log.debug("Waiting for curator to notice it's not the leader");
> Thread.sleep(100);
> }
> log.debug("Curator has noticed that it is no longer the leader");
> assertEquals(1, numNotLeader.get());
> assertEquals(1, numIsLeader.get());
> leaderLatch.close();
> // Restart ZooKeeper
> server = new TestingServer(zkServerPort, zkTmpDir);
> leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
> leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
> log.debug("Calling leaderLatch.start()");
> leaderLatch.start();
> log.debug("Trying to regain leadership");
> leaderLatch.await();
> log.debug("We have regained leadership");
> // Wait so we have time to observe the "jitter"
> Thread.sleep(100);
> assertTrue(leaderLatch.hasLeadership());
> // Bug here. numIsLeader == 3
> assertEquals(2, numIsLeader.get());
> // Bug here too, numNotLeader == 2
> assertEquals(1, numNotLeader.get());
> log.debug("calling leaderLatch.close");
> leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0 [main-EventThread] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called
> 104 [ConnectionStateManager-0] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called
> 132 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest -
> Curator has noticed that it is no longer the leader
> 171 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest -
> Calling leaderLatch.start()
> 172 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest -
> Trying to regain leadership
> 1882 [main-EventThread] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - We
> have regained leadership
> 1883 [main-EventThread] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called
> 1885 [main-EventThread] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called
> 2084 [ConnectionStateManager-0] DEBUG
> com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<<
> FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>
--
This message was sent by Atlassian JIRA
(v6.2#6252)