[jira] [Commented] (ZOOKEEPER-2162) infinite exception loop occurs when dataDir is lost
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095449#comment-15095449 ] Akihiro Suda commented on ZOOKEEPER-2162: - Ping :D > infinite exception loop occurs when dataDir is lost > --- > > Key: ZOOKEEPER-2162 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.0 >Reporter: Akihiro Suda > Attachments: ZOOKEEPER-2162-v2-repro-only.log, > ZOOKEEPER-2162-v2-repro-only.patch, ZOOKEEPER-2162-v3.patch, > ZOOKEEPER-2162-v4.patch, ZOOKEEPER-2162.patch > > > This sequence leads server.1 and server.2 to infinite exception loop. > * Start server.1 and server.2 with the initial ensemble > server.1=participant, server.2=observer. >In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2. > * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up > to 2. > * Kill server.2 > * Remove dataDir of server.2 excluding the myid file. >(In real production environments, both of confDir and dataDir can be lost > due to reprovisioning) > * Start server.2 > * server.1 and server.2 enters infinite exception loop. >The log (threshold is set to INFO in log4j.properties) size can reach > > 100MB in 30 seconds. > AFAIK, the bug can be reproduced with > ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015). > I made a Docker container so that people who are interested can reproduce the > bug easily. (Sorry for no JUnit test right now) > {noformat} > $ docker run -i -t --rm akihirosuda/zookeeper-bug01 > Reproducing the bug: infinite exception loop occurs when dataDir is lost > * Resetting > * Starting [1,2] with the initial ensemble [1] > * Sleeping for 3 seconds > * Invoking Reconfig [1]->[2] > * Sleeping for 3 seconds > * Killing server.2 (pid=10542) > * Sleeping for 3 seconds > * Resetting /zk02_data > * Starting server.2 > * Sleeping for 30 seconds > /zk01_log: 81665114 bytes > The log dir is extremely large. Perhaps the bug was REPRODUCED! > /zk02_log: 23949367 bytes > The log dir is extremely large. Perhaps the bug was REPRODUCED! > * Exiting > {noformat} > h2. Log > h3. server.1 > {noformat} > 2015-04-13 03:48:17,624 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1022] > - FOLLOWING > 2015-04-13 03:48:17,624 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@825] > - minSessionTimeout set to 4000 > 2015-04-13 03:48:17,624 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@834] > - maxSessionTimeout set to 4 > 2015-04-13 03:48:17,624 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@156] > - Created server with tickTime 2000 minSession > Timeout 4000 maxSessionTimeout 4 datadir /zk01_data/version-2 snapdir > /zk01_data/version-2 > 2015-04-13 03:48:17,624 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@66] > - FOLLOWING - LEADER ELECTION TOOK - 0 > 2015-04-13 03:48:17,625 [myid:1] - WARN > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@93] > - Exception when following the leader > java.io.IOException: Leaders epoch, 1 is less than accepted epoch, 2 > at > org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:331) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1024) > 2015-04-13 03:48:17,626 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@119] > - Unregister MBean [org.apache.ZooKeeperService: > name0=ReplicatedServer_id1,name1=replica.1,name2=Follower] > 2015-04-13 03:48:17,626 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@198] > - shutdown called > java.lang.Exception: shutdown Follower > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:198) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1028) > 2015-04-13 03:48:17,626 [myid:1] - DEBUG > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):LearnerZooKeeperServer@162] > - ZooKeeper server is not running, so n > ot proceeding to shutdown! > 2015-04-13 03:48:17,626 [myid:1] - WARN > [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1071] > - PeerState set to LOOKING > 2015-04-13 03:48:17,626 [myid:1] - INFO > [Q
[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095073#comment-15095073 ] Ted Yu commented on ZOOKEEPER-2347: --- Assuming there was only test change since I performed validation last year, this should be good to go. > Deadlock shutting down zookeeper > > > Key: ZOOKEEPER-2347 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.7 >Reporter: Ted Yu >Assignee: Rakesh R >Priority: Blocker > Fix For: 3.4.8 > > Attachments: ZOOKEEPER-2347-br-3.4.patch, > ZOOKEEPER-2347-br-3.4.patch, ZOOKEEPER-2347-br-3.4.patch, > ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack > > > HBase recently upgraded to zookeeper 3.4.7 > In one of the tests, TestSplitLogManager, there is reproducible hang at the > end of the test. > Below is snippet from stack trace related to zookeeper: > {code} > "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on > condition [0x00011834b000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007c5b8d3a0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) > "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 > nid=0x9513 waiting on condition [0x000118042000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101) > at > org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor > entry [0x0001170ac000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512) > - waiting to lock <0x0007c5b62128> (a > org.apache.zookeeper.server.ZooKeeperServer) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131) > "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on > condition [0x000117a3] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007c9b106b8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) > "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() > [0x000108aa1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0007c5b66400> (a > org.apache.zookeeper.server.SyncRequestProcessor) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x0007c5b66400> (a > org.apache.zookeeper.server.SyncRequestProcessor) > at java.lang.Thread.join(Thread.java:1355) > at > org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213) > at > org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770) > at > org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478) > - locked <0x0007c5b62128> (a > org.apache.zookeeper.server.ZooKeeperServer) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266) > at > org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301) > {code} > Note the address (0x0007c5b66400) in the last hunk which seems to > indicate some form of deadlock. > According to Camille Fournier: > We made shutdown synchronized. But decrementing the requests is > also synchronized and called from a different thread. So yeah, deadlock. > This came in
[jira] [Commented] (ZOOKEEPER-1962) Add a CLI command to recursively list a znode and children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094419#comment-15094419 ] Chris Nauroth commented on ZOOKEEPER-1962: -- +1 for Enis's proposal. That logic agrees with GNU coreutils {{ls -R}} too, so it will be familiar to end users. It still preserves the current behavior too, because you can think of the current behavior as the degenerate case of stopping before recursing. I don't think there are any backwards-compatibility concerns with that approach. > Add a CLI command to recursively list a znode and children > -- > > Key: ZOOKEEPER-1962 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1962 > Project: ZooKeeper > Issue Type: New Feature > Components: java client >Affects Versions: 3.4.6 >Reporter: Gautam Gopalakrishnan >Assignee: Gautam Gopalakrishnan >Priority: Minor > Fix For: 3.5.2, 3.6.0 > > Attachments: ZOOKEEPER-1962.diff, ZOOKEEPER-1962_v2.patch, > ZOOKEEPER-1962_v3.patch, ZOOKEEPER-1962_v4.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > When troubleshooting applications where znodes can be multiple levels deep > (eg. HBase replication), it is handy to see all child znodes recursively > rather than run an ls for each node manually. > So I propose adding an option to the "ls" command (-r) which will list all > child nodes under a given znode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094104#comment-15094104 ] Hadoop QA commented on ZOOKEEPER-2354: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12781838/ZOOKEEPER-2354-01.patch against trunk revision 1720227. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//console This message is automatically generated. > ZOOKEEPER-1653 not merged in master and 3.5 branch > -- > > Key: ZOOKEEPER-2354 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354 > Project: ZooKeeper > Issue Type: Bug >Reporter: Arshad Mohammad >Assignee: Arshad Mohammad > Fix For: 3.5.2 > > Attachments: ZOOKEEPER-2354-01.patch > > > ZOOKEEPER-1653 is merged only to 3.4 branch. > It should be merged to 3.5 and master branch as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: ZOOKEEPER-2354 PreCommit Build #3006
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2354 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 389490 lines...] [exec] http://issues.apache.org/jira/secure/attachment/12781838/ZOOKEEPER-2354-01.patch [exec] against trunk revision 1720227. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 180cc089fbcf83048a2d7efa11e4e048353cd1d9 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD SUCCESSFUL Total time: 17 minutes 23 seconds Archiving artifacts Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 Recording test results Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 [description-setter] Description set: ZOOKEEPER-2354 Email was triggered for: Success Sending email for trigger: Success Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 Setting LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7 ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-1653) zookeeper fails to start because of inconsistent epoch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094048#comment-15094048 ] Arshad Mohammad commented on ZOOKEEPER-1653: Created new jira ZOOKEEPER-2354 for merging this important fix to master and 3.5 branch. > zookeeper fails to start because of inconsistent epoch > -- > > Key: ZOOKEEPER-1653 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1653 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.5 >Reporter: Michi Mutsuzaki >Assignee: Michi Mutsuzaki >Priority: Blocker > Fix For: 3.4.6 > > Attachments: ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.3.4.patch, > ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.patch, ZOOKEEPER-1653.patch > > > It looks like QuorumPeer.loadDataBase() could fail if the server was > restarted after zk.takeSnapshot() but before finishing > self.setCurrentEpoch(newEpoch) in Learner.java. > {code:java} > case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 > zk.takeSnapshot(); > self.setCurrentEpoch(newEpoch); // <<< got restarted here > snapshotTaken = true; > writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), > true); > break; > {code} > The server fails to start because currentEpoch is still 1 but the last > processed zkid from the snapshot has been updated. > {noformat} > 2013-02-20 13:45:02,733 5543 [pool-1-thread-1] ERROR > org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on > disk > java.io.IOException: The current epoch, 1, is older than the last zxid, > 8589934592 > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:439) > at > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:413) > ... > {noformat} > {noformat} > $ find datadir > datadir > datadir/version-2 > datadir/version-2/currentEpoch.tmp > datadir/version-2/acceptedEpoch > datadir/version-2/snapshot.0 > datadir/version-2/currentEpoch > datadir/version-2/snapshot.2 > $ cat datadir/version-2/currentEpoch.tmp > 2% > $ cat datadir/version-2/acceptedEpoch > 2% > $ cat datadir/version-2/currentEpoch > 1% > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arshad Mohammad updated ZOOKEEPER-2354: --- Attachment: ZOOKEEPER-2354-01.patch Rebased the latest patch submitted by [~michim] for ZOOKEEPER-1653 > ZOOKEEPER-1653 not merged in master and 3.5 branch > -- > > Key: ZOOKEEPER-2354 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354 > Project: ZooKeeper > Issue Type: Bug >Reporter: Arshad Mohammad >Assignee: Arshad Mohammad > Fix For: 3.5.2 > > Attachments: ZOOKEEPER-2354-01.patch > > > ZOOKEEPER-1653 is merged only to 3.4 branch. > It should be merged to 3.5 and master branch as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch
Arshad Mohammad created ZOOKEEPER-2354: -- Summary: ZOOKEEPER-1653 not merged in master and 3.5 branch Key: ZOOKEEPER-2354 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354 Project: ZooKeeper Issue Type: Bug Reporter: Arshad Mohammad Assignee: Arshad Mohammad Fix For: 3.5.2 ZOOKEEPER-1653 is merged only to 3.4 branch. It should be merged to 3.5 and master branch as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2353) QuorumCnxManager protocol needs to be upgradable with-in a specific Version
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094016#comment-15094016 ] Powell Molleti commented on ZOOKEEPER-2353: --- What will be the suggested approach to expand the header ?. Is bumping the protocol version the only option ?. > QuorumCnxManager protocol needs to be upgradable with-in a specific Version > --- > > Key: ZOOKEEPER-2353 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2353 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.4.7, 3.5.1 >Reporter: Powell Molleti > > Currently 3.5.X sends its hdr as follows: > {code:title=QuorumCnxManager.java|borderStyle=solid} > dout.writeLong(PROTOCOL_VERSION); > dout.writeLong(self.getId()); > String addr = self.getElectionAddress().getHostString() + ":" + > self.getElectionAddress().getPort(); > byte[] addr_bytes = addr.getBytes(); > dout.writeInt(addr_bytes.length); > dout.write(addr_bytes); > dout.flush(); > {code} > Since it writes length of host and port byte string there is no simple way to > append new fields to this hdr anymore. I.e the rx side has to consider all > bytes after sid for host and port parsing, which is what it does here: > [QuorumCnxManager.InitialMessage.parse(): http://bit.ly/1Q0znpW] > {code:title=QuorumCnxManager.java|borderStyle=solid} > sid = din.readLong(); > int remaining = din.readInt(); > if (remaining <= 0 || remaining > maxBuffer) { > throw new InitialMessageException( > "Unreasonable buffer length: %s", remaining); > } > byte[] b = new byte[remaining]; > int num_read = din.read(b); > if (num_read != remaining) { > throw new InitialMessageException( > "Read only %s bytes out of %s sent by server %s", > num_read, remaining, sid); > } > // FIXME: IPv6 is not supported. Using something like Guava's > HostAndPort > //parser would be good. > String addr = new String(b); > String[] host_port = addr.split(":"); > {code} > This has been captured in the discussion here: ZOOKEEPER-2186. > Though it is possible to circumvent this problem by various means the request > here is to design messages with hdr such that there is no need to bump > version number or hack certain fields (i.e figure out if its length of > host/port or length of different message etc, in the above case). > This is the idea here as captured in ZOOKEEPER-2186. > {code:java} > dout.writeLong(PROTOCOL_VERSION); > String addr = self.getElectionAddress().getHostString() + ":" + > self.getElectionAddress().getPort(); > byte[] addr_bytes = addr.getBytes(); > // After version write the total length of msg sent by sender. > dout.writeInt(Long.BYTES + addr_bytes.length); > // Write sid afterwards > dout.writeLong(self.getId()); > // Write length of host/port string > dout.writeInt(addr_bytes.length); > // Write host/port string > dout.write(addr_bytes); > {code} > Since total length of the message and length of each variable field is also > present it is quite easy to provide backward compatibility, w.r.t to parsing > of the message. > Older code will read the length of message it knows and ignore the rest. > Newer revision(s), that wants to keep things compatible, will only append to > hdr and not change the meaning of current fields. > I am guessing this was the original intent w.r.t the introduction of protocol > version here: ZOOKEEPER-1633 > Since 3.4.x code does not parse this and 3.5.x is still in alpha mode perhaps > it is possible to consider this change now?. > Also I would like to propose to carefully consider the option of using > protobufs for the next protocol version bump. This will prevent issues like > this in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2186) QuorumCnxManager#receiveConnection may crash with random input
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093703#comment-15093703 ] Markus Aalto commented on ZOOKEEPER-2186: - We are using 3.4.6. The proposed patch looks as it might work for us. Although it would requires changing OS keep alive options. > QuorumCnxManager#receiveConnection may crash with random input > -- > > Key: ZOOKEEPER-2186 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2186 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6, 3.5.0 >Reporter: Raul Gutierrez Segales >Assignee: Raul Gutierrez Segales > Fix For: 3.4.7, 3.5.1, 3.6.0 > > Attachments: ZOOKEEPER-2186-v3.4.patch, ZOOKEEPER-2186.patch, > ZOOKEEPER-2186.patch, ZOOKEEPER-2186.patch > > > This will allocate an arbitrarily large byte buffer (and try to read it!): > {code} > public boolean receiveConnection(Socket sock) { > Long sid = null; > ... > sid = din.readLong(); > // next comes the #bytes in the remainder of the message > > int num_remaining_bytes = din.readInt(); > byte[] b = new byte[num_remaining_bytes]; > // remove the remainder of the message from din > > int num_read = din.read(b); > {code} > This will crash the QuorumCnxManager thread, so the cluster will keep going > but future elections might fail to converge (ditto for leaving/joining > members). > Patch coming up in a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)