[
https://issues.apache.org/jira/browse/ZOOKEEPER-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616914#comment-13616914
]
JL commented on ZOOKEEPER-1678:
-------------------------------
This probably means that "WorkSender" is blocked for that long in connect() (in
connectOne), and thus cannot contact enough peers (and get responses) before
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader() times
out, which is capped ({{maxNotificationInterval}}) at 60 secs.
In theory, the connect should take no longer that tickTime * syncLimit, which
in this case are set to: {{tickTime=2000}}, {{initLimit=10}}
> Server fails to join quorum when a peer is unreachable (5 ZK server setup)
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1678
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1678
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Environment: java version "1.6.0_32"
> Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
> Java HotSpot(TM) 64-Bit Server VM (build 20.7-b02, mixed mode)
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04.1 LTS
> Release: 12.04
> Codename: precise
> uname -a Linux ha-vani3-0 3.2.0-23-virtual #36-Ubuntu SMP Tue Apr 10 22:29:03
> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> Reporter: JL
>
> In a 5-node ZK cluster setup, in the following state:
> * 1 host is down / not reachable.
> * 4 hosts are up.
> * 3 ZK servers are in quorum.
> * a 4th ZK server was restarted and is trying to re-join the quorum.
> The 4th server is not able to rejoin the quorum because the connection to the
> host that is not established, and apparently takes to long to timeout.
> Stack traces and additional information coming.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira