[jira] [Updated] (ZOOKEEPER-2197) non-ascii character in FinalRequestProcessor.java
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michi Mutsuzaki updated ZOOKEEPER-2197: --- Attachment: ZOOKEEPER-2197.patch non-ascii character in FinalRequestProcessor.java - Key: ZOOKEEPER-2197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Project: ZooKeeper Issue Type: Bug Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2197.patch src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2163) Introduce new ZNode type: container
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566149#comment-14566149 ] Hadoop QA commented on ZOOKEEPER-2163: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12736363/zookeeper-2163.11.patch against trunk revision 1682623. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//console This message is automatically generated. Introduce new ZNode type: container --- Key: ZOOKEEPER-2163 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163 Project: ZooKeeper Issue Type: New Feature Components: c client, java client, server Affects Versions: 3.5.0 Reporter: Jordan Zimmerman Assignee: Jordan Zimmerman Fix For: 3.6.0 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, zookeeper-2163.3.patch, zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, zookeeper-2163.8.patch, zookeeper-2163.9.patch BACKGROUND A recurring problem for ZooKeeper users is garbage collection of parent nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a parent node under which participants create sequential nodes. When the participant is done, it deletes its node. In practice, the ZooKeeper tree begins to fill up with orphaned parent nodes that are no longer needed. The ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can become unstable due to the number of these nodes. CURRENT SOLUTIONS === Apache Curator has a workaround solution for this by providing the Reaper class which runs in the background looking for orphaned parent nodes and deleting them. This isn’t ideal and it would be better if ZooKeeper supported this directly. PROPOSAL = ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes to contain child nodes. This is not optimum as EPHEMERALs are tied to a session and the general use case of parent nodes is for PERSISTENT nodes. This proposal adds a new node type, CONTAINER. A CONTAINER node is the same as a PERSISTENT node with the additional property that when its last child is deleted, it is deleted (and CONTAINER nodes recursively up the tree are deleted if empty). CANONICAL USAGE {code} while ( true) { // or some reasonable limit try { zk.create(path, ...); break; } catch ( KeeperException.NoNodeException e ) { try { zk.createContainer(containerPath, ...); } catch ( KeeperException.NodeExistsException ignore) { } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: ZOOKEEPER-2163 PreCommit Build #2724
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2163 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 371198 lines...] [exec] +1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12736363/zookeeper-2163.11.patch [exec] against trunk revision 1682623. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 5 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2724//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 6d7ac2dcefac26fcc81ef3e8bcb0b204b6ab875f logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD SUCCESSFUL Total time: 13 minutes 30 seconds Archiving artifacts Sending artifact delta relative to PreCommit-ZOOKEEPER-Build #2723 Archived 24 artifacts Archive block size is 32768 Received 6 blocks and 33460976 bytes Compression is 0.6% Took 11 sec Recording test results Description set: ZOOKEEPER-2163 Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (ZOOKEEPER-2163) Introduce new ZNode type: container
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan Zimmerman updated ZOOKEEPER-2163: Attachment: zookeeper-2163.11.patch Last minute nits/reformats Introduce new ZNode type: container --- Key: ZOOKEEPER-2163 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163 Project: ZooKeeper Issue Type: New Feature Components: c client, java client, server Affects Versions: 3.5.0 Reporter: Jordan Zimmerman Assignee: Jordan Zimmerman Fix For: 3.6.0 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, zookeeper-2163.3.patch, zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, zookeeper-2163.8.patch, zookeeper-2163.9.patch BACKGROUND A recurring problem for ZooKeeper users is garbage collection of parent nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a parent node under which participants create sequential nodes. When the participant is done, it deletes its node. In practice, the ZooKeeper tree begins to fill up with orphaned parent nodes that are no longer needed. The ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can become unstable due to the number of these nodes. CURRENT SOLUTIONS === Apache Curator has a workaround solution for this by providing the Reaper class which runs in the background looking for orphaned parent nodes and deleting them. This isn’t ideal and it would be better if ZooKeeper supported this directly. PROPOSAL = ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes to contain child nodes. This is not optimum as EPHEMERALs are tied to a session and the general use case of parent nodes is for PERSISTENT nodes. This proposal adds a new node type, CONTAINER. A CONTAINER node is the same as a PERSISTENT node with the additional property that when its last child is deleted, it is deleted (and CONTAINER nodes recursively up the tree are deleted if empty). CANONICAL USAGE {code} while ( true) { // or some reasonable limit try { zk.create(path, ...); break; } catch ( KeeperException.NoNodeException e ) { try { zk.createContainer(containerPath, ...); } catch ( KeeperException.NodeExistsException ignore) { } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
ZooKeeper-trunk-openjdk7 - Build # 826 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-openjdk7/826/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 369155 lines...] [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 19:56:48,046 [myid:] - WARN [LearnerHandler-/127.0.0.1:41797:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 19:56:48,045 [myid:] - WARN [LearnerHandler-/127.0.0.1:41810:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 19:56:48,047 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-0:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 19:56:48,046 [myid:] - INFO [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:14053:NIOServerCnxnFactory$AcceptThread@219] - accept thread exitted run method [junit] 2015-05-30 19:56:48,046 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-1:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 19:56:48,048 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id5,name1=replica.5,name2=Leader] [junit] 2015-05-30 19:56:48,048 [myid:] - INFO [/127.0.0.1:14055:QuorumCnxManager$Listener@659] - Leaving listener [junit] 2015-05-30 19:56:48,048 [myid:] - WARN [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):QuorumPeer@1039] - Unexpected exception [junit] java.lang.InterruptedException [junit] at java.lang.Object.wait(Native Method) [junit] at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036) [junit] 2015-05-30 19:56:48,048 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):Leader@613] - Shutting down [junit] 2015-05-30 19:56:48,048 [myid:] - WARN [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):QuorumPeer@1070] - PeerState set to LOOKING [junit] 2015-05-30 19:56:48,048 [myid:] - WARN [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):QuorumPeer@1052] - QuorumPeer main thread exited [junit] 2015-05-30 19:56:48,048 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled) [junit] 2015-05-30 19:56:48,049 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id5] [junit] 2015-05-30 19:56:48,049 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled) to exit thread [junit] 2015-05-30 19:56:48,049 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id5,name1=replica.5] [junit] 2015-05-30 19:56:48,049 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id5,name1=replica.1] [junit] 2015-05-30 19:56:48,049 [myid:] - INFO [QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:14053)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id5,name1=replica.2] [junit] 2015-05-30
Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1
Ok, since the vote didn't pass anyways, let's fix these problems: 1. Change the default test.junit.thread to 1. Chris, could you submit a patch for this? 2. Fix the comment in FinalRequestProcessor.java. I'll submit a patch. Let me know if you guys have seen any other problems. Also, please let me know if the voting period of 2 weeks was too short. I'd like to make sure everybody gets enough time to vote. On Sat, May 30, 2015 at 8:55 AM, Flavio Junqueira fpjunque...@yahoo.com.invalid wrote: Another thing that is possibly not a reason to drop the config, but I'm getting this with this RC: [javac] /home/fpj/code/zookeeper-3.5.1-alpha/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, It is a trivial problem to solve, but it does generate a compilation error for me. -Flavio On 30 May 2015, at 15:26, Flavio Junqueira fpjunque...@yahoo.com.INVALID wrote: I don't see a reason to -1 the release just because of the number of threads junit is using. I've been a bit distracted with other things, but I'm coming back to the release candidate now. -Flavio On 23 May 2015, at 22:09, Michi Mutsuzaki mutsuz...@gmail.com wrote: I can go either way. Flavio, do you think we should set the default test.junit.threads to 1 and create another release candidate? On Fri, May 22, 2015 at 5:08 PM, Chris Nauroth cnaur...@hortonworks.com wrote: I haven't been able to repro this locally. Here are the details on my Ubuntu VM: uname -a Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux java -version java version 1.8.0_45 Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ant -version Apache Ant(TM) version 1.9.4 compiled on April 29 2014 I'm getting 100% passing test runs with multiple concurrent JUnit processes, including the tests that you mentioned were failing in your environment. I don't have any immediate ideas for what to try next. Everything has been working well on Jenkins and multiple dev machines, so it seems like there is some subtle environmental difference in this VM that I didn't handle in the ZOOKEEPER-2183 patch. Is this problematic for the release candidate? If so, then I recommend doing a quick change to set the default test.junit.threads to 1 in build.xml. That would restore the old single-process testing behavior. We can change test-patch.sh to pass -Dtest.junit.threads=8 on the command line, so we'll still get speedy pre-commit runs on Jenkins where it is working well. We all can do the same when we run ant locally too. Let me know if this is important, and I can put together a patch quickly. Thanks! --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 3:37 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 That's the range I get in the vm. I also checked the load from log test and the port it was trying to bind to is 11222. -Flavio On 22 May 2015, at 23:14, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: No worries on the delay. Thank you for sharing. That's interesting. The symptoms look similar to something we had seen from an earlier iteration of the ZOOKEEPER-2183 patch that was assigning ports from the ephemeral port range. This would cause a brief (but noticeable) window in which the OS could assign the same ephemeral port to a client socket while a server test still held onto that port assignment. It was particularly noticeable for tests that stop and restart a server on the same port, such as tests covering client reconnect logic. In the final committed version of the ZOOKEEPER-2183 patch, I excluded the ephemeral port range from use by port assignment. Typically, that's 32768 - 61000 on Linux. Is it possible that this VM is configured to use a different ephemeral port range? Here is what I get from recent stock Ubuntu and CentOS installs: cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000 --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 2:47 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 Sorry about the delay, here are the logs: http://people.apache.org/~fpj/logs-3.5.1-rc1/ the load test is giving bind exceptions. -Flavio On 21 May 2015, at
[jira] [Created] (ZOOKEEPER-2197) non-ascii character in FinalRequestProcessor.java
Michi Mutsuzaki created ZOOKEEPER-2197: -- Summary: non-ascii character in FinalRequestProcessor.java Key: ZOOKEEPER-2197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Project: ZooKeeper Issue Type: Bug Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Minor Fix For: 3.5.1, 3.6.0 src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: ZOOKEEPER-2197 PreCommit Build #2725
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 374520 lines...] [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 6d2b4da58ce2891a2db5450a4c33242e44b2b170 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1782: exec returned: 1 Total time: 13 minutes 19 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-ZOOKEEPER-Build #2724 Archived 24 artifacts Archive block size is 32768 Received 4 blocks and 33817476 bytes Compression is 0.4% Took 12 sec Recording test results Description set: ZOOKEEPER-2197 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-2197) non-ascii character in FinalRequestProcessor.java
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566189#comment-14566189 ] Hadoop QA commented on ZOOKEEPER-2197: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12736371/ZOOKEEPER-2197.patch against trunk revision 1682623. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2725//console This message is automatically generated. non-ascii character in FinalRequestProcessor.java - Key: ZOOKEEPER-2197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Project: ZooKeeper Issue Type: Bug Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2197.patch src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2189) multiple leaders can be elected when configs conflict
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566120#comment-14566120 ] Hongchao Deng commented on ZOOKEEPER-2189: -- Hi [~suda]. I committed ZOOKEEPER-2098 but mistakenly wrote the commit message to be ZOOKEEPER-2189. Would you mind to open another JIRA and grant this JIRA number to me. Thanks! multiple leaders can be elected when configs conflict - Key: ZOOKEEPER-2189 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2189 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.5.0 Reporter: Akihiro Suda This sequence leads the ensemble to a split-brain state: * Start server 1 (config=1:participant, 2:participant, 3:participant) * Start server 2 (config=1:participant, 2:participant, 3:participant) * 1 and 2 believe 2 is the leader * Start server 3 (config=1:observer, 2:observer, 3:participant) * 3 believes 3 is the leader, although 1 and 2 still believe 2 is the leader Such a split-brain ensemble is very unstable. Znodes can be lost easily: * Create some znodes on 2 * Restart 1 and 2 * 1, 2 and 3 can think 3 is the leader * znodes created on 2 are lost, as 1 and 2 sync with 3 I consider this behavior as a bug and that ZK should fail gracefully if a participant is listed as an observer in the config. In current implementation, ZK cannot detect such an invalid config, as FastLeaderElection.sendNotification() sends notifications to only voting members and hence there is no message from observers(1 and 2) to the new voter (3). I think FastLeaderElection.sendNotification() should send notifications to all the members and FastLeaderElection.Messenger.WorkerReceiver.run() should verify acks. Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2189) multiple leaders can be elected when configs conflict
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566164#comment-14566164 ] Alexander Shraer commented on ZOOKEEPER-2189: - Verifying acks as proposed in this JIRA will not solve this issue. Acks from observers are not required to elect a leader. If standaloneenabled=false server 3 can be elected without seeing any other messages. Also, suppose you wrote the wrong ports for the other servers ? It seems that to fix such errors one needs some kind of config registry. multiple leaders can be elected when configs conflict - Key: ZOOKEEPER-2189 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2189 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.5.0 Reporter: Akihiro Suda This sequence leads the ensemble to a split-brain state: * Start server 1 (config=1:participant, 2:participant, 3:participant) * Start server 2 (config=1:participant, 2:participant, 3:participant) * 1 and 2 believe 2 is the leader * Start server 3 (config=1:observer, 2:observer, 3:participant) * 3 believes 3 is the leader, although 1 and 2 still believe 2 is the leader Such a split-brain ensemble is very unstable. Znodes can be lost easily: * Create some znodes on 2 * Restart 1 and 2 * 1, 2 and 3 can think 3 is the leader * znodes created on 2 are lost, as 1 and 2 sync with 3 I consider this behavior as a bug and that ZK should fail gracefully if a participant is listed as an observer in the config. In current implementation, ZK cannot detect such an invalid config, as FastLeaderElection.sendNotification() sends notifications to only voting members and hence there is no message from observers(1 and 2) to the new voter (3). I think FastLeaderElection.sendNotification() should send notifications to all the members and FastLeaderElection.Messenger.WorkerReceiver.run() should verify acks. Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ZOOKEEPER-2198) Set default test,junit.threads to 1.
Chris Nauroth created ZOOKEEPER-2198: Summary: Set default test,junit.threads to 1. Key: ZOOKEEPER-2198 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2198 Project: ZooKeeper Issue Type: Bug Components: build Reporter: Chris Nauroth Assignee: Chris Nauroth Priority: Minor Fix For: 3.5.1, 3.6.0 Some systems are seeing test failures under concurrent execution. This issue proposes to change the default {{test.junit.threads}} to 1 so that those environments continue to get consistent test runs. Jenkins and individual developer environments can set multiple threads with a command line argument, so most environments will still get the benefit of faster test runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2197) non-ascii character in FinalRequestProcessor.java
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566313#comment-14566313 ] Michi Mutsuzaki commented on ZOOKEEPER-2197: that sounds fine. i guess we can set the encoding in build.xml? non-ascii character in FinalRequestProcessor.java - Key: ZOOKEEPER-2197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Project: ZooKeeper Issue Type: Bug Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2197.patch src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566302#comment-14566302 ] Michi Mutsuzaki commented on ZOOKEEPER-2193: Thank you for the patch reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michi Mutsuzaki updated ZOOKEEPER-2193: --- Comment: was deleted (was: Thank you for the patch ) reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566329#comment-14566329 ] Alexander Shraer commented on ZOOKEEPER-2193: - 3 more minor comments: 1) I'm not sure that my == null can ever happen both because of the checks in the calling function and because the exclude... function also excludes null. 2) Perhaps rename existing to something else since its not only existing its also the joiners that were processed before. For example if someone is adding multiple servers with the same command. Similarly the message in the thrown exception shouldn't say that the conflict is with one of the existing servers because it may be with one of the new ones. 3) Consider making the message in the exception more specific - such as port x of server #y conflicts with port x of server #z. reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2197) non-ascii character in FinalRequestProcessor.java
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566305#comment-14566305 ] Raul Gutierrez Segales commented on ZOOKEEPER-2197: --- [~michim], [~fpj]: hmm, how about using -Dfile.encoding=utf8? non-ascii character in FinalRequestProcessor.java - Key: ZOOKEEPER-2197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2197 Project: ZooKeeper Issue Type: Bug Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2197.patch src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2198) Set default test.junit.threads to 1.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated ZOOKEEPER-2198: - Summary: Set default test.junit.threads to 1. (was: Set default test,junit.threads to 1.) Set default test.junit.threads to 1. Key: ZOOKEEPER-2198 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2198 Project: ZooKeeper Issue Type: Bug Components: build Reporter: Chris Nauroth Assignee: Chris Nauroth Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2198.001.patch Some systems are seeing test failures under concurrent execution. This issue proposes to change the default {{test.junit.threads}} to 1 so that those environments continue to get consistent test runs. Jenkins and individual developer environments can set multiple threads with a command line argument, so most environments will still get the benefit of faster test runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2198) Set default test,junit.threads to 1.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated ZOOKEEPER-2198: - Attachment: ZOOKEEPER-2198.001.patch Set default test,junit.threads to 1. Key: ZOOKEEPER-2198 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2198 Project: ZooKeeper Issue Type: Bug Components: build Reporter: Chris Nauroth Assignee: Chris Nauroth Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2198.001.patch Some systems are seeing test failures under concurrent execution. This issue proposes to change the default {{test.junit.threads}} to 1 so that those environments continue to get consistent test runs. Jenkins and individual developer environments can set multiple threads with a command line argument, so most environments will still get the benefit of faster test runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1
Thank you, Michi. I filed a patch for this on ZOOKEEPER-2198. --Chris Nauroth On 5/30/15, 1:19 PM, Michi Mutsuzaki mutsuz...@gmail.com wrote: Ok, since the vote didn't pass anyways, let's fix these problems: 1. Change the default test.junit.thread to 1. Chris, could you submit a patch for this? 2. Fix the comment in FinalRequestProcessor.java. I'll submit a patch. Let me know if you guys have seen any other problems. Also, please let me know if the voting period of 2 weeks was too short. I'd like to make sure everybody gets enough time to vote. On Sat, May 30, 2015 at 8:55 AM, Flavio Junqueira fpjunque...@yahoo.com.invalid wrote: Another thing that is possibly not a reason to drop the config, but I'm getting this with this RC: [javac] /home/fpj/code/zookeeper-3.5.1-alpha/src/java/main/org/apache/zookeeper/s erver/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, It is a trivial problem to solve, but it does generate a compilation error for me. -Flavio On 30 May 2015, at 15:26, Flavio Junqueira fpjunque...@yahoo.com.INVALID wrote: I don't see a reason to -1 the release just because of the number of threads junit is using. I've been a bit distracted with other things, but I'm coming back to the release candidate now. -Flavio On 23 May 2015, at 22:09, Michi Mutsuzaki mutsuz...@gmail.com wrote: I can go either way. Flavio, do you think we should set the default test.junit.threads to 1 and create another release candidate? On Fri, May 22, 2015 at 5:08 PM, Chris Nauroth cnaur...@hortonworks.com wrote: I haven't been able to repro this locally. Here are the details on my Ubuntu VM: uname -a Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux java -version java version 1.8.0_45 Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ant -version Apache Ant(TM) version 1.9.4 compiled on April 29 2014 I'm getting 100% passing test runs with multiple concurrent JUnit processes, including the tests that you mentioned were failing in your environment. I don't have any immediate ideas for what to try next. Everything has been working well on Jenkins and multiple dev machines, so it seems like there is some subtle environmental difference in this VM that I didn't handle in the ZOOKEEPER-2183 patch. Is this problematic for the release candidate? If so, then I recommend doing a quick change to set the default test.junit.threads to 1 in build.xml. That would restore the old single-process testing behavior. We can change test-patch.sh to pass -Dtest.junit.threads=8 on the command line, so we'll still get speedy pre-commit runs on Jenkins where it is working well. We all can do the same when we run ant locally too. Let me know if this is important, and I can put together a patch quickly. Thanks! --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 3:37 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 That's the range I get in the vm. I also checked the load from log test and the port it was trying to bind to is 11222. -Flavio On 22 May 2015, at 23:14, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: No worries on the delay. Thank you for sharing. That's interesting. The symptoms look similar to something we had seen from an earlier iteration of the ZOOKEEPER-2183 patch that was assigning ports from the ephemeral port range. This would cause a brief (but noticeable) window in which the OS could assign the same ephemeral port to a client socket while a server test still held onto that port assignment. It was particularly noticeable for tests that stop and restart a server on the same port, such as tests covering client reconnect logic. In the final committed version of the ZOOKEEPER-2183 patch, I excluded the ephemeral port range from use by port assignment. Typically, that's 32768 - 61000 on Linux. Is it possible that this VM is configured to use a different ephemeral port range? Here is what I get from recent stock Ubuntu and CentOS installs: cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000 --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 2:47 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 Sorry about the delay, here are the logs: http://people.apache.org/~fpj/logs-3.5.1-rc1/
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566340#comment-14566340 ] Michi Mutsuzaki commented on ZOOKEEPER-2172: I'm guessing node1 is hitting this case? https://github.com/apache/zookeeper/blob/76bb6747c8250f28157636cf4011b78e7569727a/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L332 In this case we don't log the message that gets sent out. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565874#comment-14565874 ] Alexander Shraer commented on ZOOKEEPER-2172: - can you post the logs from the run you mention where the client doesn't disconnect ? Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
ZooKeeper-trunk - Build # 2707 - Failure
See https://builds.apache.org/job/ZooKeeper-trunk/2707/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 372051 lines...] [junit] 2015-05-30 10:40:16,900 [myid:] - WARN [LearnerHandler-/127.0.0.1:58393:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 10:40:16,902 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-1:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 10:40:16,902 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-0:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 10:40:16,903 [myid:] - INFO [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11358:NIOServerCnxnFactory$AcceptThread@219] - accept thread exitted run method [junit] 2015-05-30 10:40:16,903 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader] [junit] 2015-05-30 10:40:16,904 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):QuorumPeer@1039] - Unexpected exception [junit] java.lang.InterruptedException [junit] at java.lang.Object.wait(Native Method) [junit] at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036) [junit] 2015-05-30 10:40:16,904 [myid:] - INFO [localhost/127.0.0.1:11365:QuorumCnxManager$Listener@659] - Leaving listener [junit] 2015-05-30 10:40:16,904 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):Leader@613] - Shutting down [junit] 2015-05-30 10:40:16,904 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled) [junit] 2015-05-30 10:40:16,904 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):QuorumPeer@1070] - PeerState set to LOOKING [junit] 2015-05-30 10:40:16,904 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):QuorumPeer@1052] - QuorumPeer main thread exited [junit] 2015-05-30 10:40:16,904 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled) to exit thread [junit] 2015-05-30 10:40:16,905 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3] [junit] 2015-05-30 10:40:16,905 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3] [junit] 2015-05-30 10:40:16,905 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.1] [junit] 2015-05-30 10:40:16,905 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:11358)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.2] [junit] 2015-05-30 10:40:16,905 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11352 [junit] 2015-05-30 10:40:16,906 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11352 is no longer accepting client connections [junit] 2015-05-30 10:40:16,906 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11355 [junit] 2015-05-30 10:40:16,906 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11355 is no longer accepting client connections [junit] 2015-05-30 10:40:16,906 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11358 [junit] 2015-05-30 10:40:16,906 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11358 is no longer accepting client connections [junit] 2015-05-30 10:40:16,908 [myid:] - INFO [main:ZKTestCase$1@65] - SUCCEEDED testPortChange [junit] 2015-05-30 10:40:16,908 [myid:] - INFO [main:ZKTestCase$1@60] - FINISHED
[jira] [Commented] (ZOOKEEPER-2189) multiple leaders can be elected when configs conflict
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565932#comment-14565932 ] Hudson commented on ZOOKEEPER-2189: --- FAILURE: Integrated in ZooKeeper-trunk #2707 (See [https://builds.apache.org/job/ZooKeeper-trunk/2707/]) ZOOKEEPER-2189: QuorumCnxManager: use BufferedOutputStream for initial msg (Raul Gutierrez Segales via hdeng) (hdeng: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1682558) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java multiple leaders can be elected when configs conflict - Key: ZOOKEEPER-2189 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2189 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.5.0 Reporter: Akihiro Suda This sequence leads the ensemble to a split-brain state: * Start server 1 (config=1:participant, 2:participant, 3:participant) * Start server 2 (config=1:participant, 2:participant, 3:participant) * 1 and 2 believe 2 is the leader * Start server 3 (config=1:observer, 2:observer, 3:participant) * 3 believes 3 is the leader, although 1 and 2 still believe 2 is the leader Such a split-brain ensemble is very unstable. Znodes can be lost easily: * Create some znodes on 2 * Restart 1 and 2 * 1, 2 and 3 can think 3 is the leader * znodes created on 2 are lost, as 1 and 2 sync with 3 I consider this behavior as a bug and that ZK should fail gracefully if a participant is listed as an observer in the config. In current implementation, ZK cannot detect such an invalid config, as FastLeaderElection.sendNotification() sends notifications to only voting members and hence there is no message from observers(1 and 2) to the new voter (3). I think FastLeaderElection.sendNotification() should send notifications to all the members and FastLeaderElection.Messenger.WorkerReceiver.run() should verify acks. Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2179) Typo in Watcher.java
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565933#comment-14565933 ] Hudson commented on ZOOKEEPER-2179: --- FAILURE: Integrated in ZooKeeper-trunk #2707 (See [https://builds.apache.org/job/ZooKeeper-trunk/2707/]) ZOOKEEPER-2179: Typo in Watcher.java (Archana T via rgs) (rgs: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1682539) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/java/main/org/apache/zookeeper/Watcher.java Typo in Watcher.java Key: ZOOKEEPER-2179 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2179 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.5, 3.5.0 Reporter: Eunchan Kim Priority: Trivial Fix For: 3.4.7, 3.5.0, 3.6.0 Attachments: ZOOKEEPER-2179.patch at zookeeper/src/java/main/org/apache/zookeeper/Watcher.java, * implement. A ZooKeeper client will get various events from the ZooKeepr should be fixed to * implement. A ZooKeeper client will get various events from the ZooKeeper. (Zookeepr - Zookeeper) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2187) remove duplicated code between CreateRequest{,2}
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565931#comment-14565931 ] Hudson commented on ZOOKEEPER-2187: --- FAILURE: Integrated in ZooKeeper-trunk #2707 (See [https://builds.apache.org/job/ZooKeeper-trunk/2707/]) ZOOKEEPER-2187: remove duplicated code between CreateRequest{,2} (Raul Gutierrez Segales via hdeng) (hdeng: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1682521) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/c/src/zookeeper.c * /zookeeper/trunk/src/java/main/org/apache/zookeeper/MultiTransactionRecord.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/ZooKeeper.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java * /zookeeper/trunk/src/zookeeper.jute remove duplicated code between CreateRequest{,2} Key: ZOOKEEPER-2187 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2187 Project: ZooKeeper Issue Type: Bug Components: c client, java client, server Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2187.patch To avoid cargo culting and reducing duplicated code we can merge most of CreateRequest CreateRequest2 given that only the Response object is actually different. This will improve readability of the code plus make it less confusing for people adding new opcodes in the future (i.e.: copying a request definition vs reusing what's already there, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
ZooKeeper_branch35_jdk7 - Build # 310 - Failure
See https://builds.apache.org/job/ZooKeeper_branch35_jdk7/310/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 364664 lines...] [junit] 2015-05-30 10:10:04,165 [myid:] - WARN [LearnerHandler-/127.0.0.1:2:LearnerHandler@595] - *** GOODBYE /127.0.0.1:2 [junit] 2015-05-30 10:10:04,165 [myid:] - WARN [LearnerHandler-/127.0.0.1:3:LearnerHandler@595] - *** GOODBYE /127.0.0.1:3 [junit] 2015-05-30 10:10:04,166 [myid:] - WARN [LearnerHandler-/127.0.0.1:3:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 10:10:04,166 [myid:] - INFO [ConnnectionExpirer:NIOServerCnxnFactory$ConnectionExpirerThread@583] - ConnnectionExpirerThread interrupted [junit] 2015-05-30 10:10:04,167 [myid:] - WARN [LearnerHandler-/127.0.0.1:2:LearnerHandler@879] - Ignoring unexpected exception [junit] java.lang.InterruptedException [junit] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) [junit] at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) [junit] at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:877) [junit] at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:598) [junit] 2015-05-30 10:10:04,168 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-1:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 10:10:04,168 [myid:] - INFO [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:19442:NIOServerCnxnFactory$AcceptThread@219] - accept thread exitted run method [junit] 2015-05-30 10:10:04,168 [myid:] - INFO [NIOServerCxnFactory.SelectorThread-0:NIOServerCnxnFactory$SelectorThread@420] - selector thread exitted run method [junit] 2015-05-30 10:10:04,168 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader] [junit] 2015-05-30 10:10:04,168 [myid:] - INFO [/127.0.0.1:19444:QuorumCnxManager$Listener@659] - Leaving listener [junit] 2015-05-30 10:10:04,168 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):QuorumPeer@1039] - Unexpected exception [junit] java.lang.InterruptedException [junit] at java.lang.Object.wait(Native Method) [junit] at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036) [junit] 2015-05-30 10:10:04,169 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):Leader@613] - Shutting down [junit] 2015-05-30 10:10:04,169 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):QuorumPeer@1070] - PeerState set to LOOKING [junit] 2015-05-30 10:10:04,169 [myid:] - WARN [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):QuorumPeer@1052] - QuorumPeer main thread exited [junit] 2015-05-30 10:10:04,169 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled) [junit] 2015-05-30 10:10:04,170 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled) to exit thread [junit] 2015-05-30 10:10:04,169 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3] [junit] 2015-05-30 10:10:04,170 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3] [junit] 2015-05-30 10:10:04,170 [myid:] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:19442)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.1]
Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1
I don't see a reason to -1 the release just because of the number of threads junit is using. I've been a bit distracted with other things, but I'm coming back to the release candidate now. -Flavio On 23 May 2015, at 22:09, Michi Mutsuzaki mutsuz...@gmail.com wrote: I can go either way. Flavio, do you think we should set the default test.junit.threads to 1 and create another release candidate? On Fri, May 22, 2015 at 5:08 PM, Chris Nauroth cnaur...@hortonworks.com wrote: I haven't been able to repro this locally. Here are the details on my Ubuntu VM: uname -a Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux java -version java version 1.8.0_45 Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ant -version Apache Ant(TM) version 1.9.4 compiled on April 29 2014 I'm getting 100% passing test runs with multiple concurrent JUnit processes, including the tests that you mentioned were failing in your environment. I don't have any immediate ideas for what to try next. Everything has been working well on Jenkins and multiple dev machines, so it seems like there is some subtle environmental difference in this VM that I didn't handle in the ZOOKEEPER-2183 patch. Is this problematic for the release candidate? If so, then I recommend doing a quick change to set the default test.junit.threads to 1 in build.xml. That would restore the old single-process testing behavior. We can change test-patch.sh to pass -Dtest.junit.threads=8 on the command line, so we'll still get speedy pre-commit runs on Jenkins where it is working well. We all can do the same when we run ant locally too. Let me know if this is important, and I can put together a patch quickly. Thanks! --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 3:37 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 That's the range I get in the vm. I also checked the load from log test and the port it was trying to bind to is 11222. -Flavio On 22 May 2015, at 23:14, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: No worries on the delay. Thank you for sharing. That's interesting. The symptoms look similar to something we had seen from an earlier iteration of the ZOOKEEPER-2183 patch that was assigning ports from the ephemeral port range. This would cause a brief (but noticeable) window in which the OS could assign the same ephemeral port to a client socket while a server test still held onto that port assignment. It was particularly noticeable for tests that stop and restart a server on the same port, such as tests covering client reconnect logic. In the final committed version of the ZOOKEEPER-2183 patch, I excluded the ephemeral port range from use by port assignment. Typically, that's 32768 - 61000 on Linux. Is it possible that this VM is configured to use a different ephemeral port range? Here is what I get from recent stock Ubuntu and CentOS installs: cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000 --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 2:47 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 Sorry about the delay, here are the logs: http://people.apache.org/~fpj/logs-3.5.1-rc1/ the load test is giving bind exceptions. -Flavio On 21 May 2015, at 23:02, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: Thanks, sharing logs would be great. I'll try to repro independently with JDK8 too. --Chris Nauroth On 5/21/15, 2:30 PM, Flavio Junqueira fpjunque...@yahoo.com.INVALIDmailto:fpjunque...@yahoo.com.INVALID wrote: I accidently removed dev from the response, bringing it back in. The tests are failing intermittently for me. In the last run, I got these failing: [junit] Tests run: 8, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 30.444 sec[junit] Test org.apache.zookeeper.test.LoadFromLogTest FAILED [junit] Tests run: 86, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 264.272 sec[junit] Test org.apache.zookeeper.test.NioNettySuiteTest FAILED Still the same setup, linux + jdk 8. I can share logs if necessary. -Flavio On Thursday, May 21, 2015 8:28 PM, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: Ah, my mistake. I saw Azure and my brain jumped right to Windows. I suppose the
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566031#comment-14566031 ] Flavio Junqueira commented on ZOOKEEPER-2172: - There are a few really weird things here. Check these notifications: {noformat} Notification: 2 (message format version), -9223372036854775808 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LEADING (my state)10049 (n.config version) {noformat} I checked the logs of 3 and does sound like it sent this notification. {noformat} Sending Notification: -9223372036854775808 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 1 (recipient), 3 (myid), 0x0 (n.peerEpoch) {noformat} The initialization of leader election here doesn't sound right. And, as [~shralex] has pointed out, 2 and 3 apparently received notifications with 0x as the round of the sender. {noformat} Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x (n.round), LEADING (n.state), 1 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)10049 (n.config version) {noformat} I found no evidence in the log of 1 that it has actually set or sent such a value. The values I'm seeing in the notification across logs look a bit strange. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2098) QuorumCnxManager: use BufferedOutputStream for initial msg
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566038#comment-14566038 ] Raul Gutierrez Segales commented on ZOOKEEPER-2098: --- [~hdeng]: typo in the commit message, it's ZOOKEEPER-2098 not ZOOKEEPER-2198 QuorumCnxManager: use BufferedOutputStream for initial msg -- Key: ZOOKEEPER-2098 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2098 Project: ZooKeeper Issue Type: Improvement Components: quorum, server Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2098.patch, ZOOKEEPER-2098.patch Whilst writing fle-dump (a tool like [zk-dump|https://github.com/twitter/zktraffic/], but to dump FastLeaderElection messages), I noticed that QCM is using DataOutputStream (which doesn't buffer) directly. So all calls to write() are written immediately to the network, which means simple messaages like two participants exchanging Votes can take a couple RTTs! This is specially terrible for global clusters (i.e.: x-country RTTs). The solution is to use BufferedOutputStream for the initial negotiation between members of the cluster. Note that there are other places were suboptimal (but not entirely unbuffered) writes to the network still exist. I'll get those in separate tickets. After using BufferedOutputStream we get only 1 RTT for the initial message, so elections time for for participants to join a cluster is reduced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2189) multiple leaders can be elected when configs conflict
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566040#comment-14566040 ] Raul Gutierrez Segales commented on ZOOKEEPER-2189: --- this message is bogus, the commit message meant to reference ZOOKEEPER-2098 multiple leaders can be elected when configs conflict - Key: ZOOKEEPER-2189 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2189 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.5.0 Reporter: Akihiro Suda This sequence leads the ensemble to a split-brain state: * Start server 1 (config=1:participant, 2:participant, 3:participant) * Start server 2 (config=1:participant, 2:participant, 3:participant) * 1 and 2 believe 2 is the leader * Start server 3 (config=1:observer, 2:observer, 3:participant) * 3 believes 3 is the leader, although 1 and 2 still believe 2 is the leader Such a split-brain ensemble is very unstable. Znodes can be lost easily: * Create some znodes on 2 * Restart 1 and 2 * 1, 2 and 3 can think 3 is the leader * znodes created on 2 are lost, as 1 and 2 sync with 3 I consider this behavior as a bug and that ZK should fail gracefully if a participant is listed as an observer in the config. In current implementation, ZK cannot detect such an invalid config, as FastLeaderElection.sendNotification() sends notifications to only voting members and hence there is no message from observers(1 and 2) to the new voter (3). I think FastLeaderElection.sendNotification() should send notifications to all the members and FastLeaderElection.Messenger.WorkerReceiver.run() should verify acks. Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2098) QuorumCnxManager: use BufferedOutputStream for initial msg
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566039#comment-14566039 ] Raul Gutierrez Segales commented on ZOOKEEPER-2098: --- err, meant not ZOOKEEPER-2189 QuorumCnxManager: use BufferedOutputStream for initial msg -- Key: ZOOKEEPER-2098 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2098 Project: ZooKeeper Issue Type: Improvement Components: quorum, server Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2098.patch, ZOOKEEPER-2098.patch Whilst writing fle-dump (a tool like [zk-dump|https://github.com/twitter/zktraffic/], but to dump FastLeaderElection messages), I noticed that QCM is using DataOutputStream (which doesn't buffer) directly. So all calls to write() are written immediately to the network, which means simple messaages like two participants exchanging Votes can take a couple RTTs! This is specially terrible for global clusters (i.e.: x-country RTTs). The solution is to use BufferedOutputStream for the initial negotiation between members of the cluster. Note that there are other places were suboptimal (but not entirely unbuffered) writes to the network still exist. I'll get those in separate tickets. After using BufferedOutputStream we get only 1 RTT for the initial message, so elections time for for participants to join a cluster is reduced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1
Another thing that is possibly not a reason to drop the config, but I'm getting this with this RC: [javac] /home/fpj/code/zookeeper-3.5.1-alpha/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java:134: error: unmappable character for encoding ASCII [javac] // was not being queued ??? ZOOKEEPER-558) properly. This happens, for example, It is a trivial problem to solve, but it does generate a compilation error for me. -Flavio On 30 May 2015, at 15:26, Flavio Junqueira fpjunque...@yahoo.com.INVALID wrote: I don't see a reason to -1 the release just because of the number of threads junit is using. I've been a bit distracted with other things, but I'm coming back to the release candidate now. -Flavio On 23 May 2015, at 22:09, Michi Mutsuzaki mutsuz...@gmail.com wrote: I can go either way. Flavio, do you think we should set the default test.junit.threads to 1 and create another release candidate? On Fri, May 22, 2015 at 5:08 PM, Chris Nauroth cnaur...@hortonworks.com wrote: I haven't been able to repro this locally. Here are the details on my Ubuntu VM: uname -a Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux java -version java version 1.8.0_45 Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ant -version Apache Ant(TM) version 1.9.4 compiled on April 29 2014 I'm getting 100% passing test runs with multiple concurrent JUnit processes, including the tests that you mentioned were failing in your environment. I don't have any immediate ideas for what to try next. Everything has been working well on Jenkins and multiple dev machines, so it seems like there is some subtle environmental difference in this VM that I didn't handle in the ZOOKEEPER-2183 patch. Is this problematic for the release candidate? If so, then I recommend doing a quick change to set the default test.junit.threads to 1 in build.xml. That would restore the old single-process testing behavior. We can change test-patch.sh to pass -Dtest.junit.threads=8 on the command line, so we'll still get speedy pre-commit runs on Jenkins where it is working well. We all can do the same when we run ant locally too. Let me know if this is important, and I can put together a patch quickly. Thanks! --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 3:37 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 That's the range I get in the vm. I also checked the load from log test and the port it was trying to bind to is 11222. -Flavio On 22 May 2015, at 23:14, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: No worries on the delay. Thank you for sharing. That's interesting. The symptoms look similar to something we had seen from an earlier iteration of the ZOOKEEPER-2183 patch that was assigning ports from the ephemeral port range. This would cause a brief (but noticeable) window in which the OS could assign the same ephemeral port to a client socket while a server test still held onto that port assignment. It was particularly noticeable for tests that stop and restart a server on the same port, such as tests covering client reconnect logic. In the final committed version of the ZOOKEEPER-2183 patch, I excluded the ephemeral port range from use by port assignment. Typically, that's 32768 - 61000 on Linux. Is it possible that this VM is configured to use a different ephemeral port range? Here is what I get from recent stock Ubuntu and CentOS installs: cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000 --Chris Nauroth From: Flavio Junqueira fpjunque...@yahoo.commailto:fpjunque...@yahoo.com Date: Friday, May 22, 2015 at 2:47 PM To: Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com Cc: Zookeeper dev@zookeeper.apache.orgmailto:dev@zookeeper.apache.org Subject: Re: [VOTE] Apache ZooKeeper release 3.5.1-alpha candidate 1 Sorry about the delay, here are the logs: http://people.apache.org/~fpj/logs-3.5.1-rc1/ the load test is giving bind exceptions. -Flavio On 21 May 2015, at 23:02, Chris Nauroth cnaur...@hortonworks.commailto:cnaur...@hortonworks.com wrote: Thanks, sharing logs would be great. I'll try to repro independently with JDK8 too. --Chris Nauroth On 5/21/15, 2:30 PM, Flavio Junqueira fpjunque...@yahoo.com.INVALIDmailto:fpjunque...@yahoo.com.INVALID wrote: I accidently removed dev from the response, bringing it back in. The tests are failing intermittently for me. In the last run, I got these failing: [junit] Tests run: