Ben Sherman created ZOOKEEPER-2783:
--------------------------------------
Summary: follower disconnects and cannot reconnect
Key: ZOOKEEPER-2783
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2783
Project: ZooKeeper
Issue Type: Bug
Components: leaderElection
Affects Versions: 3.4.10
Environment: centos 7, AWS EC2
Reporter: Ben Sherman
We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well), and
sometimes, a node gets a read timeout, drops all the connections and tries to
re-establish itself to the quorum. It can usually do this in a few seconds,
but last night it took almost 15 minutes to reconnect.
These are 5 servers in AWS, and we've tried tuning the timeouts, but the are
exceeding any reasonable timeout and still failing.
In the attached logs, 5 is a follower, 3 is the leader. 5 loses connectivity
at 11:21:34. 3 sees the disconnect at the same moment.
5 tries to re-establish the quorum, but cannot do it until the connections to
the other servers expire at 11:37:02. After the connections are
re-established, 5 connects immediately.
At 11:41:08, the operator restarted the server, and it reconnected normally.
I suspect there is a problem with stale connections to the rest of the quorum -
the other services on this box were fine (monitoring, puppet) and able to
establish new connections with no problems.
I posed this problem to the zookeeper-users list and was asked to open a ticket.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)