[jira] [Commented] (ZOOKEEPER-2162) infinite exception loop occurs when dataDir is lost

2016-01-12 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095449#comment-15095449
 ] 

Akihiro Suda commented on ZOOKEEPER-2162:
-

Ping :D


> infinite exception loop occurs when dataDir is lost
> ---
>
> Key: ZOOKEEPER-2162
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0
>Reporter: Akihiro Suda
> Attachments: ZOOKEEPER-2162-v2-repro-only.log, 
> ZOOKEEPER-2162-v2-repro-only.patch, ZOOKEEPER-2162-v3.patch, 
> ZOOKEEPER-2162-v4.patch, ZOOKEEPER-2162.patch
>
>
> This sequence leads server.1 and server.2 to infinite exception loop.
>  * Start server.1 and server.2 with the initial ensemble 
> server.1=participant, server.2=observer.
>In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
>  * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up 
> to 2.
>  * Kill server.2
>  * Remove dataDir of server.2 excluding the myid file.
>(In real production environments, both of confDir and dataDir can be lost 
> due to reprovisioning)
>  * Start server.2
>  * server.1 and server.2 enters infinite exception loop.
>The log (threshold is set to INFO in log4j.properties) size can reach > 
> 100MB in 30 seconds.
> AFAIK, the bug can be reproduced with 
> ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
> I made a Docker container so that people who are interested can reproduce the 
> bug easily. (Sorry for no JUnit test right now)
> {noformat}
> $ docker run -i -t --rm akihirosuda/zookeeper-bug01
> Reproducing the bug: infinite exception loop occurs when dataDir is lost
> * Resetting
> * Starting [1,2] with the initial ensemble [1]
> * Sleeping for 3 seconds
> * Invoking Reconfig [1]->[2]
> * Sleeping for 3 seconds
> * Killing server.2 (pid=10542)
> * Sleeping for 3 seconds
> * Resetting /zk02_data
> * Starting server.2
> * Sleeping for 30 seconds
> /zk01_log: 81665114 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> /zk02_log: 23949367 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> * Exiting
> {noformat}
> h2. Log
> h3. server.1
> {noformat}
> 2015-04-13 03:48:17,624 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1022]
>  - FOLLOWING
> 2015-04-13 03:48:17,624 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@825]
>  - minSessionTimeout set to 4000
> 2015-04-13 03:48:17,624 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@834]
>  - maxSessionTimeout set to 4
> 2015-04-13 03:48:17,624 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ZooKeeperServer@156]
>  - Created server with tickTime 2000 minSession
> Timeout 4000 maxSessionTimeout 4 datadir /zk01_data/version-2 snapdir 
> /zk01_data/version-2
> 2015-04-13 03:48:17,624 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@66]
>  - FOLLOWING - LEADER ELECTION TOOK - 0
> 2015-04-13 03:48:17,625 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@93]
>  - Exception when following the leader
> java.io.IOException: Leaders epoch, 1 is less than accepted epoch, 2
> at 
> org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:331)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1024)
> 2015-04-13 03:48:17,626 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@119]
>  - Unregister MBean [org.apache.ZooKeeperService:
> name0=ReplicatedServer_id1,name1=replica.1,name2=Follower]
> 2015-04-13 03:48:17,626 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@198]
>  - shutdown called
> java.lang.Exception: shutdown Follower
> at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:198)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1028)
> 2015-04-13 03:48:17,626 [myid:1] - DEBUG 
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):LearnerZooKeeperServer@162]
>  - ZooKeeper server is not running, so n
> ot proceeding to shutdown!
> 2015-04-13 03:48:17,626 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1071]
>  - PeerState set to LOOKING
> 2015-04-13 03:48:17,626 [myid:1] - INFO  
> [Q

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2016-01-12 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095073#comment-15095073
 ] 

Ted Yu commented on ZOOKEEPER-2347:
---

Assuming there was only test change since I performed validation last year, 
this should be good to go.

> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, 
> ZOOKEEPER-2347-br-3.4.patch, ZOOKEEPER-2347-br-3.4.patch, 
> ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different thread. So yeah, deadlock.
> This came in

[jira] [Commented] (ZOOKEEPER-1962) Add a CLI command to recursively list a znode and children

2016-01-12 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094419#comment-15094419
 ] 

Chris Nauroth commented on ZOOKEEPER-1962:
--

+1 for Enis's proposal.  That logic agrees with GNU coreutils {{ls -R}} too, so 
it will be familiar to end users.  It still preserves the current behavior too, 
because you can think of the current behavior as the degenerate case of 
stopping before recursing.  I don't think there are any backwards-compatibility 
concerns with that approach.

> Add a CLI command to recursively list a znode and children
> --
>
> Key: ZOOKEEPER-1962
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1962
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: java client
>Affects Versions: 3.4.6
>Reporter: Gautam Gopalakrishnan
>Assignee: Gautam Gopalakrishnan
>Priority: Minor
> Fix For: 3.5.2, 3.6.0
>
> Attachments: ZOOKEEPER-1962.diff, ZOOKEEPER-1962_v2.patch, 
> ZOOKEEPER-1962_v3.patch, ZOOKEEPER-1962_v4.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When troubleshooting applications where znodes can be multiple levels deep  
> (eg. HBase replication), it is handy to see all child znodes recursively 
> rather than run an ls for each node manually.
> So I propose adding an option to the "ls" command (-r) which will list all 
> child nodes under a given znode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch

2016-01-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094104#comment-15094104
 ] 

Hadoop QA commented on ZOOKEEPER-2354:
--

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12781838/ZOOKEEPER-2354-01.patch
  against trunk revision 1720227.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//console

This message is automatically generated.

> ZOOKEEPER-1653 not merged in master and 3.5 branch
> --
>
> Key: ZOOKEEPER-2354
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
> Fix For: 3.5.2
>
> Attachments: ZOOKEEPER-2354-01.patch
>
>
> ZOOKEEPER-1653 is merged only to 3.4 branch. 
> It should be merged to 3.5 and master branch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: ZOOKEEPER-2354 PreCommit Build #3006

2016-01-12 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2354
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 389490 lines...]
 [exec]   
http://issues.apache.org/jira/secure/attachment/12781838/ZOOKEEPER-2354-01.patch
 [exec]   against trunk revision 1720227.
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 2.0.3) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 core tests.  The patch passed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3006//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] 180cc089fbcf83048a2d7efa11e4e048353cd1d9 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD SUCCESSFUL
Total time: 17 minutes 23 seconds
Archiving artifacts
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Recording test results
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
[description-setter] Description set: ZOOKEEPER-2354
Email was triggered for: Success
Sending email for trigger: Success
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (ZOOKEEPER-1653) zookeeper fails to start because of inconsistent epoch

2016-01-12 Thread Arshad Mohammad (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094048#comment-15094048
 ] 

Arshad Mohammad commented on ZOOKEEPER-1653:


Created new jira ZOOKEEPER-2354 for merging this important fix to master and 
3.5 branch.

> zookeeper fails to start because of inconsistent epoch
> --
>
> Key: ZOOKEEPER-1653
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1653
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Assignee: Michi Mutsuzaki
>Priority: Blocker
> Fix For: 3.4.6
>
> Attachments: ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.3.4.patch, 
> ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.patch, ZOOKEEPER-1653.patch
>
>
> It looks like QuorumPeer.loadDataBase() could fail if the server was 
> restarted after zk.takeSnapshot() but before finishing 
> self.setCurrentEpoch(newEpoch) in Learner.java.
> {code:java}
> case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
> zk.takeSnapshot();
> self.setCurrentEpoch(newEpoch); // <<< got restarted here
> snapshotTaken = true;
> writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
> true);
> break;
> {code}
> The server fails to start because currentEpoch is still 1 but the last 
> processed zkid from the snapshot has been updated.
> {noformat}
> 2013-02-20 13:45:02,733 5543 [pool-1-thread-1] ERROR 
> org.apache.zookeeper.server.quorum.QuorumPeer  - Unable to load database on 
> disk
> java.io.IOException: The current epoch, 1, is older than the last zxid, 
> 8589934592
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:439)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:413)
> ...
> {noformat}
> {noformat}
> $ find datadir 
> datadir
> datadir/version-2
> datadir/version-2/currentEpoch.tmp
> datadir/version-2/acceptedEpoch
> datadir/version-2/snapshot.0
> datadir/version-2/currentEpoch
> datadir/version-2/snapshot.2
> $ cat datadir/version-2/currentEpoch.tmp
> 2%
> $ cat datadir/version-2/acceptedEpoch
> 2%
> $ cat datadir/version-2/currentEpoch
> 1%
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch

2016-01-12 Thread Arshad Mohammad (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arshad Mohammad updated ZOOKEEPER-2354:
---
Attachment: ZOOKEEPER-2354-01.patch

Rebased the latest patch submitted by [~michim] for ZOOKEEPER-1653

> ZOOKEEPER-1653 not merged in master and 3.5 branch
> --
>
> Key: ZOOKEEPER-2354
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
> Fix For: 3.5.2
>
> Attachments: ZOOKEEPER-2354-01.patch
>
>
> ZOOKEEPER-1653 is merged only to 3.4 branch. 
> It should be merged to 3.5 and master branch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch

2016-01-12 Thread Arshad Mohammad (JIRA)
Arshad Mohammad created ZOOKEEPER-2354:
--

 Summary: ZOOKEEPER-1653 not merged in master and 3.5 branch
 Key: ZOOKEEPER-2354
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Arshad Mohammad
Assignee: Arshad Mohammad
 Fix For: 3.5.2


ZOOKEEPER-1653 is merged only to 3.4 branch. 
It should be merged to 3.5 and master branch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2353) QuorumCnxManager protocol needs to be upgradable with-in a specific Version

2016-01-12 Thread Powell Molleti (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094016#comment-15094016
 ] 

Powell Molleti commented on ZOOKEEPER-2353:
---

What will be the suggested approach to expand the header ?. Is bumping the 
protocol version the only option ?. 

> QuorumCnxManager protocol needs to be upgradable with-in a specific Version
> ---
>
> Key: ZOOKEEPER-2353
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2353
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.4.7, 3.5.1
>Reporter: Powell Molleti
>
> Currently 3.5.X sends its hdr as follows:
> {code:title=QuorumCnxManager.java|borderStyle=solid}
> dout.writeLong(PROTOCOL_VERSION);
> dout.writeLong(self.getId());
> String addr = self.getElectionAddress().getHostString() + ":" + 
> self.getElectionAddress().getPort();
> byte[] addr_bytes = addr.getBytes();
> dout.writeInt(addr_bytes.length);
> dout.write(addr_bytes);
> dout.flush();
> {code}
> Since it writes length of host and port byte string there is no simple way to 
> append new fields to this hdr anymore. I.e the rx side has to consider all 
> bytes after sid for host and port parsing, which is what it does here:
> [QuorumCnxManager.InitialMessage.parse(): http://bit.ly/1Q0znpW]
> {code:title=QuorumCnxManager.java|borderStyle=solid}
> sid = din.readLong();
> int remaining = din.readInt();
> if (remaining <= 0 || remaining > maxBuffer) {
> throw new InitialMessageException(
> "Unreasonable buffer length: %s", remaining);
> }
> byte[] b = new byte[remaining];
> int num_read = din.read(b);
> if (num_read != remaining) {
> throw new InitialMessageException(
> "Read only %s bytes out of %s sent by server %s",
> num_read, remaining, sid);
> }
> // FIXME: IPv6 is not supported. Using something like Guava's 
> HostAndPort
> //parser would be good.
> String addr = new String(b);
> String[] host_port = addr.split(":");
> {code}
> This has been captured in the discussion here: ZOOKEEPER-2186.
> Though it is possible to circumvent this problem by various means the request 
> here is to design messages with hdr such that there is no need to bump 
> version number or hack certain fields (i.e figure out if its length of 
> host/port or length of different message etc, in the above case).
> This is the idea here as captured in ZOOKEEPER-2186.
> {code:java}
> dout.writeLong(PROTOCOL_VERSION);
> String addr = self.getElectionAddress().getHostString() + ":" + 
> self.getElectionAddress().getPort();
> byte[] addr_bytes = addr.getBytes();
> // After version write the total length of msg sent by sender.
> dout.writeInt(Long.BYTES + addr_bytes.length);   
> // Write sid afterwards
> dout.writeLong(self.getId());
> // Write length of host/port string   
> dout.writeInt(addr_bytes.length);
> // Write host/port string   
> dout.write(addr_bytes); 
> {code}
> Since total length of the message and length of each variable field is also 
> present it is quite easy to provide backward compatibility, w.r.t to parsing 
> of the message. 
> Older code will read the length of message it knows and ignore the rest. 
> Newer revision(s), that wants to keep things compatible, will only append to 
> hdr and not change the meaning of current fields.
> I am guessing this was the original intent w.r.t the introduction of protocol 
> version here: ZOOKEEPER-1633
> Since 3.4.x code does not parse this and 3.5.x is still in alpha mode perhaps 
> it is possible to consider this change now?.
> Also I would like to propose to carefully consider the option of using 
> protobufs for the next protocol version bump. This will prevent issues like 
> this in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2186) QuorumCnxManager#receiveConnection may crash with random input

2016-01-12 Thread Markus Aalto (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093703#comment-15093703
 ] 

Markus Aalto commented on ZOOKEEPER-2186:
-

We are using 3.4.6. The proposed patch looks as it might work for us. Although 
it would requires changing OS keep alive options. 


> QuorumCnxManager#receiveConnection may crash with random input
> --
>
> Key: ZOOKEEPER-2186
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2186
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.6, 3.5.0
>Reporter: Raul Gutierrez Segales
>Assignee: Raul Gutierrez Segales
> Fix For: 3.4.7, 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-2186-v3.4.patch, ZOOKEEPER-2186.patch, 
> ZOOKEEPER-2186.patch, ZOOKEEPER-2186.patch
>
>
> This will allocate an arbitrarily large byte buffer (and try to read it!):
> {code}
> public boolean receiveConnection(Socket sock) {
> Long sid = null;
> ...
> sid = din.readLong();
> // next comes the #bytes in the remainder of the message  
>
> int num_remaining_bytes = din.readInt();
> byte[] b = new byte[num_remaining_bytes];
> // remove the remainder of the message from din   
>
> int num_read = din.read(b);
> {code}
> This will crash the QuorumCnxManager thread, so the cluster will keep going 
> but future elections might fail to converge (ditto for leaving/joining 
> members). 
> Patch coming up in a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)