[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490096#comment-15490096
 ] 

Andrey commented on ZOOKEEPER-2386:
-----------------------------------

We are able to reproduce this issue on 3.4.6.

Steps to reproduce should include unreachable host in configuration. 
"123.123.123.123:1234" should be fine.

> Cannot achieve quorum when middle server (in a q of 3) is unreacable
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2386
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2386
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>         Attachments: zklogs.tar.gz
>
>
> Recently, we've observed a curious case where a quorum was not reached for 
> days in a cluster of 3 nodes (zk0, zk1, zk2) and the middle node zk1 is 
> unreachable from network. 
> The leader election happens, and both zk0 and zk2 starts the vote. Then each 
> server sends notifications to every other server including itself. The 
> problem is that, zk1 vm is unavailable, so when we are trying to open up a 
> socket to connect to that server with socket timeout of 5 seconds, it delays 
> the notification processing of the vote sent from zk2 to zk2 (itself). The 
> vote eventually comes after 5 sec, but by that time, the follower (zk0) 
> already converted to the follower state. On the follower state, the follower 
> try to connect to leader 5 times with 1 second timeout (5 sec in total). 
> Since the leader does not start its peer port for 5 seconds after the 
> follower starts, the follower always times out connecting to the leader. This 
> cycle is repeating for hours / days even after restarting the servers several 
> times. 
> I believe this is related to the default timeouts (5 sec socket timeout) and 
> follower to leader connection timeout (5 tries with 1 second timeout). Only 
> after setting the {{zookeeper.cnxTimeout}} to 1 second, the quorum was 
> operating. 
> More logs coming shortly. 
> zoo.cfg: 
> {code}
> server.3=zk2-hostname:2889:3889
> server.2=zk1-hostname:2889:3889
> server.1=zk0-hostname:2889:3889
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to