[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

Hudson (JIRA) Sun, 15 Mar 2015 04:17:06 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362341#comment-14362341
 ]


Hudson commented on ZOOKEEPER-1865:
-----------------------------------

FAILURE: Integrated in ZooKeeper-trunk #2629 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/2629/])
ZOOKEEPER-1865 Fix retry logic in Learner.connectToLeader() (Edward Carter via 
michim) (michim: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1666784)
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
* 
/zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java


> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>             Fix For: 3.5.1, 3.6.0
>
>         Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

Reply via email to