ZooKeeper_branch35_jdk7 - Build # 1043 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch35_jdk7/1043/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 63.43 MB...] [junit] 2017-07-14 08:51:36,148 [myid:] - WARN [New I/O boss #5219:ClientCnxnSocketNetty$ZKClientHandler@439] - Exception caught: [id: 0xb355bd36] EXCEPTION: java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:30196 [junit] java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:30196 [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [junit] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [junit] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [junit] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [junit] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [junit] at java.lang.Thread.run(Thread.java:745) [junit] 2017-07-14 08:51:36,149 [myid:] - INFO [New I/O boss #5219:ClientCnxnSocketNetty@208] - channel is told closing [junit] 2017-07-14 08:51:36,149 [myid:127.0.0.1:30196] - INFO [main-SendThread(127.0.0.1:30196):ClientCnxn$SendThread@1231] - channel for sessionid 0x104b9b9dbbd is lost, closing socket connection and attempting reconnect [junit] 2017-07-14 08:51:36,488 [myid:127.0.0.1:30199] - INFO [main-SendThread(127.0.0.1:30199):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30199. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 08:51:36,489 [myid:] - INFO [New I/O boss #5236:ClientCnxnSocketNetty$1@127] - future isn't success, cause: {} [junit] java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:30199 [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [junit] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [junit] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [junit] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [junit] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [junit] at java.lang.Thread.run(Thread.java:745) [junit] 2017-07-14 08:51:36,489 [myid:] - WARN [New I/O boss #5236:ClientCnxnSocketNetty$ZKClientHandler@439] - Exception caught: [id: 0x3660a4f0] EXCEPTION: java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:30199 [junit] java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:30199 [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [junit] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [junit] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [junit] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [junit] at
[GitHub] zookeeper pull request #279: ZOOKEEPER-2804:Node creation fails with NPE if ...
Github user jainbhupendra24 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/279#discussion_r127447190 --- Diff: src/java/main/org/apache/zookeeper/ZooKeeper.java --- @@ -1436,7 +1436,7 @@ public String create(final String path, byte data[], List acl, request.setData(data); request.setFlags(createMode.toFlag()); request.setPath(serverPath); -if (acl != null && acl.size() == 0) { +if (acl == null || acl.isEmpty() || acl.contains(null)) { --- End diff -- Throwing IllegalArgumentException is the correct way but it may break the existing behavior. Consider the case where application code is like zk.create(callBackHandler)` // call to create api and callback handler is passed doSomething(); - - - - // some more stuff At present, if client pass null ACL parameter, the error will be handled by callBackHandler and in application code doSomething() method will execute. But if we throw IllegalArgumentException , then doSomething() method will not execute. So it will change the existing behavior of application code. Please suggest if we really need to handle this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ZOOKEEPER-2804) Node creation fails with NPE if ACLs are null
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087254#comment-16087254 ] ASF GitHub Bot commented on ZOOKEEPER-2804: --- Github user jainbhupendra24 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/279#discussion_r127447190 --- Diff: src/java/main/org/apache/zookeeper/ZooKeeper.java --- @@ -1436,7 +1436,7 @@ public String create(final String path, byte data[], List acl, request.setData(data); request.setFlags(createMode.toFlag()); request.setPath(serverPath); -if (acl != null && acl.size() == 0) { +if (acl == null || acl.isEmpty() || acl.contains(null)) { --- End diff -- Throwing IllegalArgumentException is the correct way but it may break the existing behavior. Consider the case where application code is like zk.create(callBackHandler)` // call to create api and callback handler is passed doSomething(); - - - - // some more stuff At present, if client pass null ACL parameter, the error will be handled by callBackHandler and in application code doSomething() method will execute. But if we throw IllegalArgumentException , then doSomething() method will not execute. So it will change the existing behavior of application code. Please suggest if we really need to handle this case. > Node creation fails with NPE if ACLs are null > - > > Key: ZOOKEEPER-2804 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2804 > Project: ZooKeeper > Issue Type: Bug >Reporter: Bhupendra Kumar Jain > > If null ACLs are passed then zk node creation or set ACL fails with NPE > {code} > java.lang.NullPointerException > at > org.apache.zookeeper.server.PrepRequestProcessor.removeDuplicates(PrepRequestProcessor.java:1301) > at > org.apache.zookeeper.server.PrepRequestProcessor.fixupACL(PrepRequestProcessor.java:1341) > at > org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:519) > at > org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:1126) > at > org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:178) > {code} > Expected to handle null in server and return proper error code to client -- This message was sent by Atlassian JIRA (v6.4.14#64029)
ZooKeeper_branch34_openjdk7 - Build # 1575 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch34_openjdk7/1575/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer [EnvInject] - Loading node environment variables. Building remotely on qnode1 (ubuntu) in workspace /home/jenkins/jenkins-slave/workspace/ZooKeeper_branch34_openjdk7 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url git://git.apache.org/zookeeper.git # timeout=10 Cleaning workspace > git rev-parse --verify HEAD # timeout=10 Resetting working tree > git reset --hard # timeout=10 > git clean -fdx # timeout=10 Fetching upstream changes from git://git.apache.org/zookeeper.git > git --version # timeout=10 > git fetch --tags --progress git://git.apache.org/zookeeper.git > +refs/heads/*:refs/remotes/origin/* > git rev-parse refs/remotes/origin/branch-3.4^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/branch-3.4^{commit} # timeout=10 Checking out Revision 3872e49f6fa6835e0fca2229bec41d1b3cad4553 (refs/remotes/origin/branch-3.4) > git config core.sparsecheckout # timeout=10 > git checkout -f 3872e49f6fa6835e0fca2229bec41d1b3cad4553 > git rev-list 3872e49f6fa6835e0fca2229bec41d1b3cad4553 # timeout=10 No emails were triggered. [ZooKeeper_branch34_openjdk7] $ /home/jenkins/tools/ant/apache-ant-1.9.9/bin/ant -Dtest.output=yes -Dtest.junit.threads=8 -Dtest.junit.output.format=xml -Djavac.target=1.7 clean test-core-java Error: JAVA_HOME is not defined correctly. We cannot execute /usr/lib/jvm/java-7-openjdk-amd64//bin/java Build step 'Invoke Ant' marked build as failure Recording test results ERROR: Step ‘Publish JUnit test result report’ failed: No test report files were found. Configuration error? Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## No tests ran.
ZooKeeper-trunk-jdk8 - Build # 1128 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-jdk8/1128/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 65.86 MB...] [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:01:14,625 [myid:127.0.0.1:30073] - INFO [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30073. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:01:14,626 [myid:127.0.0.1:30073] - WARN [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1235] - Session 0xbef47698 for server 127.0.0.1/127.0.0.1:30073, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:01:15,081 [myid:127.0.0.1:30076] - INFO [main-SendThread(127.0.0.1:30076):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30076. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:01:15,081 [myid:127.0.0.1:30076] - WARN [main-SendThread(127.0.0.1:30076):ClientCnxn$SendThread@1235] - Session 0x100bef47953 for server 127.0.0.1/127.0.0.1:30076, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:01:15,222 [myid:127.0.0.1:30073] - INFO [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30073. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:01:15,222 [myid:127.0.0.1:30073] - WARN [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1235] - Session 0xbef476980001 for server 127.0.0.1/127.0.0.1:30073, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:01:15,463 [myid:127.0.0.1:30076] - INFO [main-SendThread(127.0.0.1:30076):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30076. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:01:15,464 [myid:127.0.0.1:30076] - WARN [main-SendThread(127.0.0.1:30076):ClientCnxn$SendThread@1235] - Session 0x100bef479530001 for server 127.0.0.1/127.0.0.1:30076, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:01:16,116 [myid:127.0.0.1:30073] - INFO [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30073. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:01:16,116 [myid:127.0.0.1:30073] - WARN [main-SendThread(127.0.0.1:30073):ClientCnxn$SendThread@1235] - Session 0xbef47698 for server 127.0.0.1/127.0.0.1:30073, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException:
[GitHub] zookeeper pull request #310: [ZOOKEEPER-2845][Test] Test used to reproduce t...
Github user lvfangmin commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/310#discussion_r127567852 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for(int i = 0; i < SERVER_COUNT; i++) { + clientPorts[i] = PortAssignment.unique(); + sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + +waitForAll(zk, States.CONNECTED); + + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) {} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); +p.qvAcksetPairs.get(0).getAckset().contains(leader); + +// 6. wait the leader to quit due to no enough followers +waitForOne(zk[leader], States.CONNECTING); + +int newLeader = -1; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +newLeader = i; +} +} +// make sure a different leader was elected +
[GitHub] zookeeper pull request #310: [ZOOKEEPER-2845][Test] Test used to reproduce t...
GitHub user lvfangmin opened a pull request: https://github.com/apache/zookeeper/pull/310 [ZOOKEEPER-2845][Test] Test used to reproduce the data inconsistency issue due to retain database in leader election You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845-TEST Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 commit ff0bc49de51635da1d5bff0e4f260a61acc87db0 Author: Fangmin LyuDate: 2017-07-14T23:02:20Z reproduce the data inconsistency issue --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088277#comment-16088277 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- GitHub user lvfangmin opened a pull request: https://github.com/apache/zookeeper/pull/310 [ZOOKEEPER-2845][Test] Test used to reproduce the data inconsistency issue due to retain database in leader election You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845-TEST Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 commit ff0bc49de51635da1d5bff0e4f260a61acc87db0 Author: Fangmin LyuDate: 2017-07-14T23:02:20Z reproduce the data inconsistency issue > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088296#comment-16088296 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user lvfangmin commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/310#discussion_r127567852 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for(int i = 0; i < SERVER_COUNT; i++) { + clientPorts[i] = PortAssignment.unique(); + sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + +waitForAll(zk, States.CONNECTED); + + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) {} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); +p.qvAcksetPairs.get(0).getAckset().contains(leader); + +// 6. wait the leader to quit due to no enough followers +waitForOne(zk[leader], States.CONNECTING); + +int newLeader = -1;
[jira] [Updated] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangmin Lv updated ZOOKEEPER-2845: -- Affects Version/s: 3.6.0 > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088306#comment-16088306 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~hanm] T1 only exists in txn file but hasn't been applied to the data tree yet, the lastProcessedZxid in follower C is T0, so no TRUNC message when sync with leader. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Failed: ZOOKEEPER- PreCommit Build #883
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 71.51 MB...] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] -1 core tests. The patch failed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] dea2761bab8c6e73897e813361b13e245383c891 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] [exec] mv: '/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess' and '/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess' are the same file BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml:1642: exec returned: 1 Total time: 12 minutes 44 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Recording test results Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 [description-setter] Description set: ZOOKEEPER-2845 Putting comment on the pull request Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Email was triggered for: Failure - Any Sending email for trigger: Failure - Any Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 ### ## FAILED TESTS (if any) ## 3 tests failed. FAILED: org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testTxnAheadSnapInRetainDB Error Message: server 2 should not have /zk2 Stack Trace: junit.framework.AssertionFailedError: server 2 should not have /zk2 at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testTxnAheadSnapInRetainDB(QuorumPeerMainTest.java:900) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79) FAILED: org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig Error Message: waiting for server 4 being up Stack Trace: junit.framework.AssertionFailedError: waiting for server 4 being up at org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig(ReconfigRecoveryTest.java:223) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79) FAILED: org.apache.zookeeper.server.quorum.ReconfigDuringLeaderSyncTest.testDuringLeaderSync Error Message: zoo.cfg.dynamic.next is not deleted. Stack Trace: junit.framework.AssertionFailedError: zoo.cfg.dynamic.next is not deleted. at org.apache.zookeeper.server.quorum.ReconfigDuringLeaderSyncTest.testDuringLeaderSync(ReconfigDuringLeaderSyncTest.java:165)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088293#comment-16088293 ] Hadoop QA commented on ZOOKEEPER-2845: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console This message is automatically generated. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
Fangmin Lv created ZOOKEEPER-2845: - Summary: Data inconsistency issue due to retain database in leader election Key: ZOOKEEPER-2845 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.5.3, 3.4.10 Reporter: Fangmin Lv Assignee: Fangmin Lv Priority: Critical In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time during leader election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message being received yet. If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained during shutdown, the snapshot and txn file will keep consistent before leader election happening, so this is not an issue. But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent issue, here is the simplified scenario to show the issue: Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is leader, and all the snapshot and txn are up to T0: 1. A new request reached to leader C to create Node N, and it's converted to txn T1 2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers, A and B restarted, so the T1 didn't exist in A and B 3. A and B formed a new quorum after restart, let's say B is the leader 4. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync 5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1, now it will have Node N, but A and B doesn't have it. Also I included the a test case to reproduce this issue consistently. We have a totally different RetainDB version which will avoid this issue by doing consensus between snapshot and txn files before leader election, will submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
ZooKeeper-trunk - Build # 3466 - Failure
See https://builds.apache.org/job/ZooKeeper-trunk/3466/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 64.15 MB...] [junit] 2017-07-14 23:25:39,087 [myid:] - INFO [ProcessThread(sid:0 cport:24933)::PrepRequestProcessor@155] - PrepRequestProcessor exited loop! [junit] 2017-07-14 23:25:39,088 [myid:] - INFO [SyncThread:0:SyncRequestProcessor@169] - SyncRequestProcessor exited! [junit] 2017-07-14 23:25:39,088 [myid:] - INFO [main:FinalRequestProcessor@481] - shutdown of request processor complete [junit] 2017-07-14 23:25:39,088 [myid:] - INFO [main:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port24933,name1=InMemoryDataTree] [junit] 2017-07-14 23:25:39,089 [myid:] - INFO [main:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port24933] [junit] 2017-07-14 23:25:39,089 [myid:] - INFO [main:FourLetterWordMain@85] - connecting to 127.0.0.1 24933 [junit] 2017-07-14 23:25:39,090 [myid:] - INFO [main:JMXEnv@146] - ensureOnly:[] [junit] 2017-07-14 23:25:39,090 [myid:127.0.0.1:24810] - INFO [main-SendThread(127.0.0.1:24810):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:24810. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 23:25:39,090 [myid:127.0.0.1:24810] - WARN [main-SendThread(127.0.0.1:24810):ClientCnxn$SendThread@1235] - Session 0x1056d1908e3 for server 127.0.0.1/127.0.0.1:24810, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 23:25:39,102 [myid:] - INFO [main:ClientBase@626] - fdcount after test is: 7147 at start it was 7147 [junit] 2017-07-14 23:25:39,103 [myid:] - INFO [main:ZKTestCase$1@68] - SUCCEEDED testWatcherAutoResetWithLocal [junit] 2017-07-14 23:25:39,103 [myid:] - INFO [main:ZKTestCase$1@63] - FINISHED testWatcherAutoResetWithLocal [junit] Tests run: 103, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 400.493 sec, Thread: 6, Class: org.apache.zookeeper.test.NioNettySuiteTest [junit] 2017-07-14 23:25:39,210 [myid:127.0.0.1:24816] - INFO [main-SendThread(127.0.0.1:24816):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:24816. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 23:25:39,210 [myid:127.0.0.1:24816] - WARN [main-SendThread(127.0.0.1:24816):ClientCnxn$SendThread@1235] - Session 0x3056d1908e9 for server 127.0.0.1/127.0.0.1:24816, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 23:25:39,223 [myid:] - INFO [SessionTracker:SessionTrackerImpl@158] - SessionTrackerImpl exited loop! [junit] 2017-07-14 23:25:39,223 [myid:] - INFO [SessionTracker:SessionTrackerImpl@158] - SessionTrackerImpl exited loop! [junit] 2017-07-14 23:25:39,558 [myid:127.0.0.1:24736] - INFO [main-SendThread(127.0.0.1:24736):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:24736. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 23:25:39,558 [myid:127.0.0.1:24736] - WARN [main-SendThread(127.0.0.1:24736):ClientCnxn$SendThread@1235] - Session 0x1056d1686840001 for server 127.0.0.1/127.0.0.1:24736, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) fail.build.on.test.failure: BUILD FAILED /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/build.xml:1338: The following error occurred while executing this line:
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088299#comment-16088299 ] Michael Han commented on ZOOKEEPER-2845: Thanks for reporting this issue [~lvfangmin]. bq. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync Are you saying leader B is sending a DIFF to follower C in this case? Since B does not have T1, I think it should send a TRUNC and C should drop T1 in its txn log. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088464#comment-16088464 ] Michael Han commented on ZOOKEEPER-2845: Make sense to me. I think previously we don't have this issue because the {{zkDb}} was cleared across leader election, and if we restarting C it will recover from both the snap and the tnx log so it will find out its {{lastProcessedZxid}} is T1, rather than T0, which will yield a TRUNC instead of DIFF from leader B. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088464#comment-16088464 ] Michael Han edited comment on ZOOKEEPER-2845 at 7/15/17 5:21 AM: - Make sense to me. I think previously we don't have this issue because the {{zkDb}} was cleared across leader election, and C will find out its {{lastProcessedZxid}} is T1, rather than T0, while reinitialize {{zkDb}} from snap/tnx log which will yield a TRUNC instead of DIFF from leader B. was (Author: hanm): Make sense to me. I think previously we don't have this issue because the {{zkDb}} was cleared across leader election, and if we restarting C it will recover from both the snap and the tnx log so it will find out its {{lastProcessedZxid}} is T1, rather than T0, which will yield a TRUNC instead of DIFF from leader B. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
ZooKeeper-trunk-WinVS2008 - Build # 2471 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-WinVS2008/2471/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found [EnvInject] - Loading node environment variables. No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found Building remotely on windows-2012-2 (Windows) in workspace f:\jenkins\jenkins-slave\workspace\ZooKeeper-trunk-WinVS2008 No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url git://git.apache.org/zookeeper.git # timeout=10 Cleaning workspace > git rev-parse --verify HEAD # timeout=10 Resetting working tree > git reset --hard # timeout=10 > git clean -fdx # timeout=10 Fetching upstream changes from git://git.apache.org/zookeeper.git > git --version # timeout=10 > git fetch --tags --progress git://git.apache.org/zookeeper.git > +refs/heads/*:refs/remotes/origin/* > git rev-parse "refs/remotes/origin/master^{commit}" # timeout=10 > git rev-parse "refs/remotes/origin/origin/master^{commit}" # timeout=10 No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found Checking out Revision ddf0364903bf7ac7cd25b2e1927f0d9d3c7203c4 (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f ddf0364903bf7ac7cd25b2e1927f0d9d3c7203c4 > git rev-list ddf0364903bf7ac7cd25b2e1927f0d9d3c7203c4 # timeout=10 No emails were triggered. No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found [ZooKeeper-trunk-WinVS2008] $ cmd.exe /C "F:\jenkins\tools\ant\latest\bin\ant.bat -Dtest.output=yes -Dtest.junit.output.format=xml clean compile_jute && exit %%ERRORLEVEL%%" java.lang.UnsupportedClassVersionError: org/apache/tools/ant/launch/Launcher : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482) Exception in thread "main" Build step 'Invoke Ant' marked build as failure No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found Email was triggered for: Failure - Any Sending email for trigger: Failure - Any No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found No JDK named ‘JDK 1.8 (unlimited security) 64-bit Windows only’ found ### ## FAILED TESTS (if any) ## No tests ran.
ZooKeeper_branch35_jdk8 - Build # 600 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch35_jdk8/600/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 66.81 MB...] [junit] 2017-07-14 12:27:13,045 [myid:] - INFO [SyncThread:0:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port16854,name1=Connections,name2=127.0.0.1,name3=0x10686e04e06] [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main-EventThread:ClientCnxn$EventThread@513] - EventThread shut down for session: 0x10686e04e06 [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:ZooKeeper@1331] - Session: 0x10686e04e06 closed [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@82] - Memory used 100505 [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@87] - Number of threads 471 [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@102] - FINISHED TEST METHOD testWatcherAutoResetWithLocal [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:ClientBase@586] - tearDown starting [junit] 2017-07-14 12:27:13,046 [myid:] - INFO [main:ClientBase@556] - STOPPING server [junit] 2017-07-14 12:27:13,047 [myid:] - INFO [main:NettyServerCnxnFactory@464] - shutdown called 0.0.0.0/0.0.0.0:16854 [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:ZooKeeperServer@541] - shutting down [junit] 2017-07-14 12:27:13,048 [myid:] - ERROR [main:ZooKeeperServer@505] - ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:SessionTrackerImpl@232] - Shutting down [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:PrepRequestProcessor@1005] - Shutting down [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:SyncRequestProcessor@191] - Shutting down [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [ProcessThread(sid:0 cport:16854)::PrepRequestProcessor@155] - PrepRequestProcessor exited loop! [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [SyncThread:0:SyncRequestProcessor@169] - SyncRequestProcessor exited! [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:FinalRequestProcessor@481] - shutdown of request processor complete [junit] 2017-07-14 12:27:13,048 [myid:] - INFO [main:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port16854,name1=InMemoryDataTree] [junit] 2017-07-14 12:27:13,049 [myid:] - INFO [main:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port16854] [junit] 2017-07-14 12:27:13,049 [myid:] - INFO [main:FourLetterWordMain@85] - connecting to 127.0.0.1 16854 [junit] 2017-07-14 12:27:13,049 [myid:] - INFO [main:JMXEnv@146] - ensureOnly:[] [junit] 2017-07-14 12:27:13,053 [myid:] - INFO [main:ClientBase@611] - fdcount after test is: 1395 at start it was 1391 [junit] 2017-07-14 12:27:13,053 [myid:] - INFO [main:ClientBase@613] - sleeping for 20 secs [junit] 2017-07-14 12:27:13,053 [myid:] - INFO [main:ZKTestCase$1@68] - SUCCEEDED testWatcherAutoResetWithLocal [junit] 2017-07-14 12:27:13,053 [myid:] - INFO [main:ZKTestCase$1@63] - FINISHED testWatcherAutoResetWithLocal [junit] Tests run: 103, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 427.355 sec, Thread: 3, Class: org.apache.zookeeper.test.NioNettySuiteTest [junit] 2017-07-14 12:27:13,293 [myid:127.0.0.1:16608] - INFO [main-SendThread(127.0.0.1:16608):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:16608. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:27:13,295 [myid:127.0.0.1:16608] - WARN [main-SendThread(127.0.0.1:16608):ClientCnxn$SendThread@1235] - Session 0x10686d9e18e for server 127.0.0.1/127.0.0.1:16608, unexpected error, closing socket connection and attempting reconnect [junit] java.net.ConnectException: Connection refused [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214) [junit] 2017-07-14 12:27:13,298 [myid:127.0.0.1:16731] - INFO [main-SendThread(127.0.0.1:16731):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:16731. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 12:27:13,299 [myid:127.0.0.1:16731] - WARN [main-SendThread(127.0.0.1:16731):ClientCnxn$SendThread@1235] - Session 0x10686dd157e for
ZooKeeper_branch35_openjdk7 - Build # 600 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch35_openjdk7/600/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 64.07 MB...] [junit] 2017-07-14 10:11:08,692 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1229] - Client session timed out, have not heard from server in 30022ms for sessionid 0x0, closing socket connection and attempting reconnect [junit] 2017-07-14 10:11:10,054 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30079. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 10:11:10,056 [myid:2] - INFO [NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:30079:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:59589 [junit] 2017-07-14 10:11:10,056 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@946] - Socket connection established, initiating session, client: /127.0.0.1:59589, server: 127.0.0.1/127.0.0.1:30079 [junit] 2017-07-14 10:11:28,766 [myid:127.0.0.1:30079] - WARN [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1181] - Client session timed out, have not heard from server in 30031ms for sessionid 0x2022f857561 [junit] 2017-07-14 10:11:28,767 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1229] - Client session timed out, have not heard from server in 30031ms for sessionid 0x2022f857561, closing socket connection and attempting reconnect [junit] 2017-07-14 10:11:30,408 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30079. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 10:11:30,409 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@946] - Socket connection established, initiating session, client: /127.0.0.1:59594, server: 127.0.0.1/127.0.0.1:30079 [junit] 2017-07-14 10:11:30,410 [myid:2] - INFO [NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:30079:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:59594 [junit] 2017-07-14 10:11:40,073 [myid:127.0.0.1:30079] - WARN [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1181] - Client session timed out, have not heard from server in 30017ms for sessionid 0x0 [junit] 2017-07-14 10:11:40,073 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1229] - Client session timed out, have not heard from server in 30017ms for sessionid 0x0, closing socket connection and attempting reconnect [junit] 2017-07-14 10:11:41,596 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30079. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 10:11:41,597 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@946] - Socket connection established, initiating session, client: /127.0.0.1:59599, server: 127.0.0.1/127.0.0.1:30079 [junit] 2017-07-14 10:11:41,598 [myid:2] - INFO [NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:30079:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:59599 [junit] 2017-07-14 10:12:00,439 [myid:127.0.0.1:30079] - WARN [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1181] - Client session timed out, have not heard from server in 30030ms for sessionid 0x2022f857561 [junit] 2017-07-14 10:12:00,440 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1229] - Client session timed out, have not heard from server in 30030ms for sessionid 0x2022f857561, closing socket connection and attempting reconnect [junit] 2017-07-14 10:12:01,675 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1113] - Opening socket connection to server 127.0.0.1/127.0.0.1:30079. Will not attempt to authenticate using SASL (unknown error) [junit] 2017-07-14 10:12:01,676 [myid:127.0.0.1:30079] - INFO [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@946] - Socket connection established, initiating session, client: /127.0.0.1:59605, server: 127.0.0.1/127.0.0.1:30079 [junit] 2017-07-14 10:12:01,677 [myid:2] - INFO [NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:30079:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:59605 [junit] 2017-07-14 10:12:11,616 [myid:127.0.0.1:30079] - WARN [main-SendThread(127.0.0.1:30079):ClientCnxn$SendThread@1181] -