[jira] [Commented] (CURATOR-87) new LeaderLatch "jitters" after network outage

Evaristo (JIRA) Fri, 23 May 2014 14:40:21 -0700

    [ 
https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007745#comment-14007745
 ]


Evaristo commented on CURATOR-87:
---------------------------------

Hi there,

In my opinion this test is not following the API contract

LeaderLatch does not require to be closed when connection with ZK is closed. 
You can use ConnectionListener to track that errors, and once you are 
reconnected the same LeaderLatch can be used.

I created a test similar to your test without closing the latch and the etst is 
always passed

        @Test
        public void leaderLatchJittersBis() throws Exception {
                TestingServer server = new TestingServer();
                CuratorFramework zkClient = 
CuratorFrameworkFactory.newClient(server.getConnectString(),
                                new ExponentialBackoffRetry(1000, 3));
                zkClient.start();
                LeaderLatch leaderLatch = new LeaderLatch(zkClient, 
"/path/to/lock");
                final AtomicInteger numIsLeader = new AtomicInteger(0);
                final AtomicInteger numNotLeader = new AtomicInteger(0);
                LeaderLatchListener lll = new LeaderLatchListener() {
                        @Override
                        public void isLeader()
                        { System.out.println("isLeader called"); 
numIsLeader.incrementAndGet(); }
                        @Override
                        public void notLeader()
                        { System.out.println("notLeader called"); 
numNotLeader.incrementAndGet(); }
                };
                leaderLatch.addListener(lll, 
MoreExecutors.sameThreadExecutor());
                leaderLatch.start();
                leaderLatch.await();
                assertTrue(leaderLatch.hasLeadership());
                assertEquals(1, numIsLeader.get());
                assertEquals(0, numNotLeader.get());
                // Shut down the server, wait for us to lose the lock, then 
restart
                File zkTmpDir = server.getTempDirectory();
                int zkServerPort = server.getPort();
                server.stop();
                while (leaderLatch.hasLeadership())
                { System.out.println("Waiting for curator to notice it's not 
the leader"); Thread.sleep(100); }
                System.out.println("Curator has noticed that it is no longer 
the leader");
                assertEquals(1, numNotLeader.get());
                assertEquals(1, numIsLeader.get());
                // Restart ZooKeeper
                server = new TestingServer(zkServerPort, zkTmpDir);
                
                System.out.println("Trying to regain leadership");
                leaderLatch.await();
                System.out.println("We have regained leadership");
                // Wait so we have time to observe the "jitter"
                Thread.sleep(100);
                assertTrue(leaderLatch.hasLeadership());
                // Bug here. numIsLeader == 3
                assertEquals(2, numIsLeader.get());
                // Bug here too, numNotLeader == 2
                assertEquals(1, numNotLeader.get());
                System.out.println("calling leaderLatch.close");
                leaderLatch.close();
        }


On the other hand, your original test is always passing to me if I add a 
Thread.sleep(100)

                leaderLatch.close();
                Thread.sleep(100);
                // Restart ZooKeeper
                server = new TestingServer(zkServerPort, zkTmpDir);


My feeling is that your test is generating some kind of "artiificial"  race 
condition. close() method for the LeaderLatch is calling background operations 
so you can see that as an async operation.

What do you think Jordan? I can take a deeper look if you think this is really 
a bug


> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
>                 Key: CURATOR-87
>                 URL: https://issues.apache.org/jira/browse/CURATOR-87
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>         Environment: OS-X
>            Reporter: Oliver Dain
>            Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper 
> becomes unreachable (due to network issues or something). I do know that I 
> could maintain the same LeaderLatch instance and when ZK becomes reachable 
> again it would re-negotiate leadership. However, for my particular use case 
> this doesn't work and I have to release the LeaderLatch. Later, when ZK is 
> available again I allocate a new LeaderLatch instance and call start() and on 
> it. The bug is that this when await() is called on the new latch it 
> immediately calls the isLeader callback and then almost immediately after the 
> await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
>  @Test
>     public void leaderLatchJitters() throws Exception {
>         TestingServer server = new TestingServer();
>         CuratorFramework zkClient = 
> CuratorFrameworkFactory.newClient(server.getConnectString(),
>                 new ExponentialBackoffRetry(1000, 3));
>         zkClient.start();
>         LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         final AtomicInteger numIsLeader = new AtomicInteger(0);
>         final AtomicInteger numNotLeader = new AtomicInteger(0);
>         LeaderLatchListener lll = new LeaderLatchListener() {
>             @Override
>             public void isLeader() {
>                 log.debug("isLeader called");
>                 numIsLeader.incrementAndGet();
>             }
>             @Override
>             public void notLeader() {
>                 log.debug("notLeader called");
>                 numNotLeader.incrementAndGet();
>             }
>         };
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         leaderLatch.start();
>         leaderLatch.await();
>         assertTrue(leaderLatch.hasLeadership());
>         assertEquals(1, numIsLeader.get());
>         assertEquals(0, numNotLeader.get());
>         // Shut down the server, wait for us to lose the lock, then restart
>         File zkTmpDir = server.getTempDirectory();
>         int zkServerPort = server.getPort();
>         server.stop();
>         while (leaderLatch.hasLeadership()) {
>             log.debug("Waiting for curator to notice it's not the leader");
>             Thread.sleep(100);
>         }
>         log.debug("Curator has noticed that it is no longer the leader");
>         assertEquals(1, numNotLeader.get());
>         assertEquals(1, numIsLeader.get());
>         leaderLatch.close();
>         // Restart ZooKeeper
>         server = new TestingServer(zkServerPort, zkTmpDir);
>         leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         log.debug("Calling leaderLatch.start()");
>         leaderLatch.start();
>         log.debug("Trying to regain leadership");
>         leaderLatch.await();
>         log.debug("We have regained leadership");
>         // Wait so we have time to observe the "jitter"
>         Thread.sleep(100);
>         assertTrue(leaderLatch.hasLeadership());
>         // Bug here. numIsLeader == 3
>         assertEquals(2, numIsLeader.get());
>         // Bug here too, numNotLeader == 2
>         assertEquals(1, numNotLeader.get());
>         log.debug("calling leaderLatch.close");
>         leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0    [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 104  [ConnectionStateManager-0] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Curator has noticed that it is no longer the leader
> 171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Calling leaderLatch.start()
> 172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - 
> Trying to regain leadership
> 1882 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We 
> have regained leadership
> 1883 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 1885 [main-EventThread] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 2084 [ConnectionStateManager-0] DEBUG 
> com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<< 
> FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CURATOR-87) new LeaderLatch "jitters" after network outage

Reply via email to