[
https://issues.apache.org/jira/browse/ZOOKEEPER-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421324#comment-13421324
]
Flavio Junqueira commented on ZOOKEEPER-1515:
---------------------------------------------
Hi Ian, I'm ok with changing 1000 with some function of tickTime. I'm not sure
about the if block, though. We already try to connect straight away in
connectToLeader and sleep only if the initial attempt is unsuccessful. With
your proposal, we would be trying twice with no sleep in between if I
understand your proposal correctly.
> Long reconnect timeout if leader failed.
> ----------------------------------------
>
> Key: ZOOKEEPER-1515
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1515
> Project: ZooKeeper
> Issue Type: Improvement
> Components: leaderElection, quorum, server
> Affects Versions: 3.3.5
> Environment: Gentoo linux, but every environment is affected.
> Reporter: Ian Babrou
> Labels: patch, performance
>
> In zookeeper 3.3.5 in file
> src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see
> Thread.sleep(1000);
> This is always happens after leader failure or restart. Zookeeper reelects
> new leader and all followers try to connect to it. But first attempt always
> fails because of "Connection refused":
> {quote}
> 2012-07-23 18:55:48,159 - WARN [QuorumPeer:/0.0.0.0:2181:Learner@229] -
> Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> at java.net.Socket.connect(Socket.java:529)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
> {quote}
> I propose to change this line to the next code:
> {code:title=Learner.java|borderStyle=solid}
> if (tries > 0) {
> Thread.sleep(self.tickTime);
> }
> {code}
> This way first reconnect attempt will be done immediately, other will wait
> for tick time (this is good semantic change, I suppose).
> The result of this change - leader reelection time lowered from >1500ms to
> 300-400ms with 50ms tick time. This is pretty important for our production
> environment and will not break any existing installations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira