ZooKeeper_branch34_jdk7 - Build # 897 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch34_jdk7/897/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 213627 lines...] [junit] 2015-05-25 09:36:28,732 [myid:] - INFO [main:JMXEnv@246] - expect:StandaloneServer_port [junit] 2015-05-25 09:36:28,732 [myid:] - INFO [main:JMXEnv@250] - found:StandaloneServer_port org.apache.ZooKeeperService:name0=StandaloneServer_port11221 [junit] 2015-05-25 09:36:28,732 [myid:] - INFO [main:ClientBase@490] - STOPPING server [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@224] - NIOServerCnxn factory exited run method [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [main:ZooKeeperServer@441] - shutting down [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [main:SessionTrackerImpl@225] - Shutting down [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [main:PrepRequestProcessor@760] - Shutting down [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [main:SyncRequestProcessor@209] - Shutting down [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [SyncThread:0:SyncRequestProcessor@187] - SyncRequestProcessor exited! [junit] 2015-05-25 09:36:28,734 [myid:] - INFO [main:FinalRequestProcessor@415] - shutdown of request processor complete [junit] 2015-05-25 09:36:28,733 [myid:] - INFO [ProcessThread(sid:0 cport:11221)::PrepRequestProcessor@143] - PrepRequestProcessor exited loop! [junit] 2015-05-25 09:36:28,734 [myid:] - INFO [main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221 [junit] 2015-05-25 09:36:28,734 [myid:] - INFO [main:JMXEnv@146] - ensureOnly:[] [junit] 2015-05-25 09:36:28,736 [myid:] - INFO [main:ClientBase@443] - STARTING server [junit] 2015-05-25 09:36:28,736 [myid:] - INFO [main:ClientBase@364] - CREATING server instance 127.0.0.1:11221 [junit] 2015-05-25 09:36:28,736 [myid:] - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:11221 [junit] 2015-05-25 09:36:28,736 [myid:] - INFO [main:ClientBase@339] - STARTING server instance 127.0.0.1:11221 [junit] 2015-05-25 09:36:28,737 [myid:] - INFO [main:ZooKeeperServer@162] - Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 6 datadir /home/jenkins/jenkins-slave/workspace/ZooKeeper_branch34_jdk7/branch-3.4/build/test/tmp/test7109069252678877654.junit.dir/version-2 snapdir /home/jenkins/jenkins-slave/workspace/ZooKeeper_branch34_jdk7/branch-3.4/build/test/tmp/test7109069252678877654.junit.dir/version-2 [junit] 2015-05-25 09:36:28,739 [myid:] - INFO [main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221 [junit] 2015-05-25 09:36:28,740 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:50004 [junit] 2015-05-25 09:36:28,740 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@827] - Processing stat command from /127.0.0.1:50004 [junit] 2015-05-25 09:36:28,740 [myid:] - INFO [Thread-4:NIOServerCnxn$StatCommand@663] - Stat command output [junit] 2015-05-25 09:36:28,741 [myid:] - INFO [Thread-4:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:50004 (no session established for client) [junit] 2015-05-25 09:36:28,741 [myid:] - INFO [main:JMXEnv@229] - ensureParent:[InMemoryDataTree, StandaloneServer_port] [junit] 2015-05-25 09:36:28,742 [myid:] - INFO [main:JMXEnv@246] - expect:InMemoryDataTree [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JMXEnv@250] - found:InMemoryDataTree org.apache.ZooKeeperService:name0=StandaloneServer_port11221,name1=InMemoryDataTree [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JMXEnv@246] - expect:StandaloneServer_port [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JMXEnv@250] - found:StandaloneServer_port org.apache.ZooKeeperService:name0=StandaloneServer_port11221 [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@58] - Memory used 10471 [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@63] - Number of threads 20 [junit] 2015-05-25 09:36:28,743 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@78] - FINISHED TEST METHOD testQuota [junit] 2015-05-25 09:36:28,744 [myid:] - INFO [main:ClientBase@520] - tearDown starting [junit] 2015-05-25 09:36:28,821 [myid:] - INFO [main:ZooKeeper@684] - Session: 0x14d8a6e37b9 closed [junit] 2015-05-25 09:36:28,822 [myid:] - INFO [main:ClientBase@490] - STOPPING server [junit] 2015-05-25 09:36:28,822 [myid:] - INFO [main-EventThread:ClientCnxn$EventThread@512] - EventThread shut down for session: 0x14d8a6e37b9 [junit] 2015-05-25 09:36:28,822 [myid:] - INFO
ZooKeeper-trunk-solaris - Build # 1042 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-solaris/1042/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer Building remotely on solaris1 (Solaris) in workspace /export/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris Updating http://svn.apache.org/repos/asf/zookeeper/trunk at revision '2015-05-25T08:32:18.579 +' At revision 1681563 Updating http://svn.apache.org/repos/asf/hadoop/nightly at revision '2015-05-25T08:32:18.579 +' At revision 1681563 no change for http://svn.apache.org/repos/asf/zookeeper/trunk since the previous build no change for http://svn.apache.org/repos/asf/hadoop/nightly since the previous build No emails were triggered. [locks-and-latches] Checking to see if we really have the locks [locks-and-latches] Have all the locks, build can start [ZooKeeper-trunk-solaris] $ /bin/bash /var/tmp/hudson4219020641947618107.sh /var/tmp/hudson4219020641947618107.sh: line 12: ant: command not found Build step 'Execute shell' marked build as failure [locks-and-latches] Releasing all the locks [locks-and-latches] All the locks released Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558059#comment-14558059 ] Michi Mutsuzaki commented on ZOOKEEPER-2172: Thanks Ziyou. [~fpj] [~shralex] it looks like node1 and node2 are not forming a quorum because node2 has seen zxid 0x10059 but node1 keeps sending 0x0 as its zxid. Isn't node1 supposed send the highest zxid it has seen? From zookeeper-1.log: {noformat} 2015-05-25 12:34:36,920 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=2 zxid=0x0 leader=1 config version = 10049 2015-05-25 12:34:39,090 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=3 zxid=0x0 leader=1 config version = 10049 2015-05-25 12:35:28,128 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=2 zxid=0x0 leader=1 config version = 10049 2015-05-25 12:35:30,301 [myid:1] - DEBUG [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@423] - Sending new notification. My id =1 recipient=3 zxid=0x0 leader=1 config version = 10049 {noformat} From zookeeper-2.log: {noformat} 2015-05-25 12:34:36,918 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message format version), 2 (n.leader), 0x10059 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)10049 (n.config version) 2015-05-25 12:34:36,923 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x (n.round), LEADING (n.state), 1 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)10049 (n.config version) 2015-05-25 12:35:28,124 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.0.0.2:1300:FastLeaderElection@688] - Sending Notification: 2 (n.leader), 0x10059 (n.zxid), 0x1 (n.round), 1 (recipient), 2 (myid), 0x1 (n.peerEpoch) 2015-05-25 12:35:28,125 [myid:2] - DEBUG [QuorumPeer[myid=2]/10.0.0.2:1300:FastLeaderElection@688] - Sending Notification: 2 (n.leader), 0x10059 (n.zxid), 0x1 (n.round), 2 (recipient), 2 (myid), 0x1 (n.peerEpoch) {noformat} Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
ZooKeeper-trunk-openjdk7 - Build # 821 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-openjdk7/821/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 369784 lines...] [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-25 19:56:44,616 [myid:] - WARN [LearnerHandler-/127.0.0.1:52700:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-25 19:56:44,617 [myid:] - INFO [ConnnectionExpirer:NIOServerCnxnFactory$ConnectionExpirerThread@583] - ConnnectionExpirerThread interrupted [junit] 2015-05-25 19:56:44,618 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-0:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-25 19:56:44,618 [myid:] - INFO [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:30211:NIOServerCnxnFactory$AcceptThread@219] - accept thread exitted run method [junit] 2015-05-25 19:56:44,618 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-1:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader] [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [/127.0.0.1:30213:QuorumCnxManager$Listener@654] - Leaving listener [junit] 2015-05-25 19:56:44,619 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):QuorumPeer@1039] - Unexpected exception [junit] java.lang.InterruptedException [junit] at java.lang.Object.wait(Native Method) [junit] at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036) [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):Leader@613] - Shutting down [junit] 2015-05-25 19:56:44,619 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):QuorumPeer@1070] - PeerState set to LOOKING [junit] 2015-05-25 19:56:44,619 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):QuorumPeer@1052] - QuorumPeer main thread exited [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled) [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3] [junit] 2015-05-25 19:56:44,619 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled) to exit thread [junit] 2015-05-25 19:56:44,620 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3] [junit] 2015-05-25 19:56:44,620 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.1] [junit] 2015-05-25 19:56:44,620 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:30211)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.2] [junit] 2015-05-25 19:56:44,620 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 30205 [junit] 2015-05-25 19:56:44,621 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:30205
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558632#comment-14558632 ] Ziyou Wang commented on ZOOKEEPER-2172: --- The reason why there are many clients connect and disconnect is we use zk cli to do the operations. So the cli just create a client, do the operation and shutdown it. From our tests, I think it should be a race condition problem and may not have relationship with the cli. When the cluster is running in a faster environment, we just need to wait a small interval between starting the node3 and reconfiguring it to join cluster, then we almost can avoid this problem. But when we move to a slower environment, it can be reproduced consistently again. So the temp solution is extending the waiting time and we can avoid it again. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558634#comment-14558634 ] Yasuhito Fukuda commented on ZOOKEEPER-2193: Hi [~shralex], [~mitake], thanks for your comments. I have recreated patch to reflect your comments. please see patch v2. loopback address(localhost/127.0.0.1) and wildCard address(0.0.0.0) may be same address:port by different servers, i think. So, I added the code to exclude them in validation. reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558599#comment-14558599 ] Alexander Shraer commented on ZOOKEEPER-2193: - Thanks for the patch! here are some comments: One can use reconfig -add to change the ports of a server dynamically, basically adding a server that is already in the system, with different ports. In that case it is possible that all the ports will be the same but one. To do this nextServers.remove(qs.id); removes the old spec of a server and the following line adds the new spec. To avoid checking a server against its old self, I think that nextServers.remove(qs.id); should go before your checks. Also, why do you need joinedServers, isn't nextServers defined above your code sufficient ? [~michim], can you take a look as well ? do you know if the same address:port can be shared by different servers ? (please see comment above by [~Yasuhito Fukuda]) reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Attachments: ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558594#comment-14558594 ] Alexander Shraer commented on ZOOKEEPER-2172: - Michi, I agree that its weird, and again I see this 0x round number which I think causes the other server to ignore the leader's messages and continue looking even though server 1 is leading (without timing out for some reason). This looks related to ZOOKEEPER-1732 and ZOOKEEPER-1805. [~fpj] could you take a look ? It also seems like there are a lot of client sessions being established and destroyed (clients connect and disconnect). And in particular when the reconfig adding server 3 happens (12:33:21,797 on server 1) the client session 0x14d8b08424a0014 (this is the client that submitted the reconfig) gets closed in the middle of the operation. Then, the connection to server 2 is suddenly closed with the error (on server 2) 2015-05-25 12:33:25,786 [myid:2] - WARN [QuorumPeer[myid=2]/10.0.0.2:1300:Follower@92] - Exception when following the leader java.net.SocketTimeoutException: Read timed out Could it be that the termination of a client session in the middle of an op messes up server-to-server connections ? Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasuhito Fukuda updated ZOOKEEPER-2193: --- Attachment: ZOOKEEPER-2193-v2.patch reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558641#comment-14558641 ] Alexander Shraer commented on ZOOKEEPER-2172: - when you run in the slower environment does server 2 still disconnect like in the logs when you add server 3 ? Can you please provide your initial configs for all 3 servers ? and the reconfig commands you run ? I'd like to try to reproduce this. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558709#comment-14558709 ] Alexander Shraer commented on ZOOKEEPER-2172: - Thanks. The basic scenario you described works for me, but you're right of course - there may be a race condition. It would be great if you could help investigate this further. For example, why does server 2 disconnect when you add server 3 ? BTW, server 3 doesn't even have to be up in order for you to add it to the ensemble since you have 2 out of 3 servers without it. I also wonder if the problem has anything to do with the client shutting off right after -- does it happen if you don't close the connection on the client ? Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558698#comment-14558698 ] Ziyou Wang commented on ZOOKEEPER-2172: --- Yes, it has the same problem. The step is: 1. Start server 1, its zoo.cfg contains: standaloneEnabled=false syncLimit=2 tickTime=2000 initLimit=5 ant its dynamic config file just contains itself. 2. Start server 2, its zoo.cfg is the same and its init dynamic config file contains 1 and itself. Then reconfig itself as a participant after it sync data from 1 (need to try a few time until the reconfig can be finished). 3. Start server 3, use the same zoo.cfg and its init dynamic config file contains 1, 2 and itself. Then reconfig itself as the third participant (where we hit the bug). Since it is a race condition, it may be not easy to be reproduced in your environment. In my dev, the zk server is running in a docker-in-VM environment. If I change the script to just wait 5 secs after starting the server 3, the problem can be reproduced every time. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558642#comment-14558642 ] Alexander Shraer commented on ZOOKEEPER-2193: - Thanks! I'll take a closer look tomorrow, but as I mentioned, the line nextServers.remove(qs.id); should probably go before your changes, not after, otherwise we won't be able to change a server's port. reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Shraer updated ZOOKEEPER-2193: Assignee: Yasuhito Fukuda reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: ZOOKEEPER-2193 PreCommit Build #2711
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2711/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 370849 lines...] [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] -1 core tests. The patch failed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2711//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2711//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2711//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] cd47fc2ed4cc3623f810450e59d5b2f10f8bf2e3 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1782: exec returned: 2 Total time: 8 minutes 0 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-ZOOKEEPER-Build #2708 Archived 7 artifacts Archive block size is 32768 Received 3 blocks and 465417 bytes Compression is 17.4% Took 4.2 sec Recording test results Description set: ZOOKEEPER-2193 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 1 tests failed. REGRESSION: org.apache.zookeeper.server.quorum.LearnerSnapshotThrottlerTest.testHighContentionWithTimeout Error Message: null Stack Trace: junit.framework.AssertionFailedError at org.apache.zookeeper.server.quorum.LearnerSnapshotThrottlerTest.testHighContentionWithTimeout(LearnerSnapshotThrottlerTest.java:211) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:55)