[jira] [Resolved] (CURATOR-87) new LeaderLatch "jitters" after network outage

Jordan Zimmerman (JIRA) Sat, 24 May 2014 08:40:31 -0700

     [ 
https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jordan Zimmerman resolved CURATOR-87.
-------------------------------------

    Resolution: Not a Problem

I agree with Evaristo here. Also, please note, there has been other work on 
background stability, etc. that may mitigate the OP's issues.

> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
>                 Key: CURATOR-87
>                 URL: https://issues.apache.org/jira/browse/CURATOR-87
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>         Environment: OS-X
>            Reporter: Oliver Dain
>            Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper 
> becomes unreachable (due to network issues or something). I do know that I 
> could maintain the same LeaderLatch instance and when ZK becomes reachable 
> again it would re-negotiate leadership. However, for my particular use case 
> this doesn't work and I have to release the LeaderLatch. Later, when ZK is 
> available again I allocate a new LeaderLatch instance and call start() and on 
> it. The bug is that this when await() is called on the new latch it 
> immediately calls the isLeader callback and then almost immediately after the 
> await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
>  @Test
>     public void leaderLatchJitters() throws Exception {
>         TestingServer server = new TestingServer();
>         CuratorFramework zkClient = 
> CuratorFrameworkFactory.newClient(server.getConnectString(),
>                 new ExponentialBackoffRetry(1000, 3));
>         zkClient.start();
>         LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         final AtomicInteger numIsLeader = new AtomicInteger(0);
>         final AtomicInteger numNotLeader = new AtomicInteger(0);
>         LeaderLatchListener lll = new LeaderLatchListener() {
>             @Override
>             public void isLeader() {
>                 log.debug("isLeader called");
>                 numIsLeader.incrementAndGet();
>             }
>             @Override
>             public void notLeader() {
>                 log.debug("notLeader called");
>                 numNotLeader.incrementAndGet();
>             }
>         };
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         leaderLatch.start();
>         leaderLatch.await();
>         assertTrue(leaderLatch.hasLeadership());
>         assertEquals(1, numIsLeader.get());
>         assertEquals(0, numNotLeader.get());
>         // Shut down the server, wait for us to lose the lock, then restart
>         File zkTmpDir = server.getTempDirectory();
>         int zkServerPort = server.getPort();
>         server.stop();
>         while (leaderLatch.hasLeadership()) {
>             log.debug("Waiting for curator to notice it's not the leader");
>             Thread.sleep(100);
>         }
>         log.debug("Curator has noticed that it is no longer the leader");
>         assertEquals(1, numNotLeader.get());
>         assertEquals(1, numIsLeader.get());
>         leaderLatch.close();
>         // Restart ZooKeeper
>         server = new TestingServer(zkServerPort, zkTmpDir);
>         leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         log.debug("Calling leaderLatch.start()");
>         leaderLatch.start();
>         log.debug("Trying to regain leadership");
>         leaderLatch.await();
>         log.debug("We have regained leadership");
>         // Wait so we have time to observe the "jitter"
>         Thread.sleep(100);
>         assertTrue(leaderLatch.hasLeadership());
>         // Bug here. numIsLeader == 3
>         assertEquals(2, numIsLeader.get());
>         // Bug here too, numNotLeader == 2
>         assertEquals(1, numNotLeader.get());
>         log.debug("calling leaderLatch.close");
>         leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0    [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 104  [ConnectionStateManager-0] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Curator has noticed that it is no longer the leader
> 171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Calling leaderLatch.start()
> 172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Trying to regain leadership
> 1882 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We 
> have regained leadership
> 1883 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 1885 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 2084 [ConnectionStateManager-0] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<< 
> FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (CURATOR-87) new LeaderLatch "jitters" after network outage

Reply via email to