[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

ASF GitHub Bot (JIRA) Fri, 16 Feb 2018 14:42:05 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367960#comment-16367960
 ]


ASF GitHub Bot commented on ZOOKEEPER-2845:
-------------------------------------------

Github user afine commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/453#discussion_r168887935
  
    --- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
    @@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
                     maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
         }
     
    +    @Test
    +    public void testFailedTxnAsPartOfQuorumLoss() throws Exception {
    +        // 1. start up server and wait for leader election to finish
    +        ClientBase.setupTestEnv();
    +        final int SERVER_COUNT = 3;
    +        servers = LaunchServers(SERVER_COUNT);
    +
    +        waitForAll(servers, States.CONNECTED);
    +
    +        // we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
    +        // that is rather innocuous.
    +        servers.shutDownAllServers();
    +        waitForAll(servers, States.CONNECTING);
    +        servers.restartAllServersAndClients(this);
    +        waitForAll(servers, States.CONNECTED);
    +
    +        // 2. kill all followers
    +        int leader = servers.findLeader();
    +        Map<Long, Proposal> outstanding =  
servers.mt[leader].main.quorumPeer.leader.outstandingProposals;
    +        // increase the tick time to delay the leader going to looking
    +        servers.mt[leader].main.quorumPeer.tickTime = 10000;
    +        LOG.warn("LEADER {}", leader);
    +
    +        for (int i = 0; i < SERVER_COUNT; i++) {
    +            if (i != leader) {
    +                servers.mt[i].shutdown();
    +            }
    +        }
    +
    +        // 3. start up the followers to form a new quorum
    +        for (int i = 0; i < SERVER_COUNT; i++) {
    +            if (i != leader) {
    +                servers.mt[i].start();
    +            }
    +        }
    +
    +        // 4. wait one of the follower to be the new leader
    +        for (int i = 0; i < SERVER_COUNT; i++) {
    +            if (i != leader) {
    +                // Recreate a client session since the previous session 
was not persisted.
    +                servers.restartClient(i, this);
    +                waitForOne(servers.zk[i], States.CONNECTED);
    +            }
    +        }
    +
    +        // 5. send a create request to old leader and make sure it's 
synced to disk,
    +        //    which means it acked from itself
    +        try {
    +            servers.zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
    +                CreateMode.PERSISTENT);
    +            Assert.fail("create /zk" + leader + " should have failed");
    +        } catch (KeeperException e) {
    +        }
    +
    +        // just make sure that we actually did get it in process at the
    +        // leader
    +        Assert.assertEquals(1, outstanding.size());
    +        Proposal p = outstanding.values().iterator().next();
    +        Assert.assertEquals(OpCode.create, p.request.getHdr().getType());
    +
    +        // make sure it has a chance to write it to disk
    +        int sleepTime = 0;
    +        Long longLeader = new Long(leader);
    +        while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) {
    +            if (sleepTime > 2000) {
    +                Assert.fail("Transaction not synced to disk within 1 
second " + p.qvAcksetPairs.get(0).getAckset()
    +                    + " expected " + leader);
    +            }
    +            Thread.sleep(100);
    +            sleepTime += 100;
    +        }
    +
    +        // 6. wait for the leader to quit due to not enough followers and 
come back up as a part of the new quorum
    +        sleepTime = 0;
    +        Follower f = servers.mt[leader].main.quorumPeer.follower;
    +        while (f == null || !f.isRunning()) {
    +            if (sleepTime > 10_000) {
    +                Assert.fail("Took too long for old leader to time out " + 
servers.mt[leader].main.quorumPeer.getPeerState());
    +            }
    +            Thread.sleep(100);
    +            sleepTime += 100;
    +            f = servers.mt[leader].main.quorumPeer.follower;
    +        }
    +        servers.mt[leader].shutdown();
    --- End diff --
    
    why do we need this?


> Data inconsistency issue due to retain database in leader election
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2845
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

Reply via email to