[
https://issues.apache.org/jira/browse/ZOOKEEPER-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490096#comment-15490096
]
Andrey commented on ZOOKEEPER-2386:
-----------------------------------
We are able to reproduce this issue on 3.4.6.
Steps to reproduce should include unreachable host in configuration.
"123.123.123.123:1234" should be fine.
> Cannot achieve quorum when middle server (in a q of 3) is unreacable
> --------------------------------------------------------------------
>
> Key: ZOOKEEPER-2386
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2386
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Enis Soztutar
> Attachments: zklogs.tar.gz
>
>
> Recently, we've observed a curious case where a quorum was not reached for
> days in a cluster of 3 nodes (zk0, zk1, zk2) and the middle node zk1 is
> unreachable from network.
> The leader election happens, and both zk0 and zk2 starts the vote. Then each
> server sends notifications to every other server including itself. The
> problem is that, zk1 vm is unavailable, so when we are trying to open up a
> socket to connect to that server with socket timeout of 5 seconds, it delays
> the notification processing of the vote sent from zk2 to zk2 (itself). The
> vote eventually comes after 5 sec, but by that time, the follower (zk0)
> already converted to the follower state. On the follower state, the follower
> try to connect to leader 5 times with 1 second timeout (5 sec in total).
> Since the leader does not start its peer port for 5 seconds after the
> follower starts, the follower always times out connecting to the leader. This
> cycle is repeating for hours / days even after restarting the servers several
> times.
> I believe this is related to the default timeouts (5 sec socket timeout) and
> follower to leader connection timeout (5 tries with 1 second timeout). Only
> after setting the {{zookeeper.cnxTimeout}} to 1 second, the quorum was
> operating.
> More logs coming shortly.
> zoo.cfg:
> {code}
> server.3=zk2-hostname:2889:3889
> server.2=zk1-hostname:2889:3889
> server.1=zk0-hostname:2889:3889
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)