[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

Jared Cantwell (JIRA) Tue, 27 Jan 2015 16:45:12 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294503#comment-14294503
 ]


Jared Cantwell commented on ZOOKEEPER-1865:
-------------------------------------------

Camille, we didn't like the use of currentTimeMillis because its not safe 
against time jumps and we've had problems with that in the past, so I'm 
thinking of polishing up the patch I just attached that uses System.nanoTime 
instead.  What do you think of that approach?

Do you have suggestions for some good tests that can leverage the nanoTime 
overridable method without further poking into the internals of 
connectToLeader?  Or were you thinking we should use it in already existing 
tests?

> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>             Fix For: 3.5.1
>
>         Attachments: ZOOKEEPER-1865-nanoTime-noUT.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

Reply via email to