[
https://issues.apache.org/jira/browse/ZOOKEEPER-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
maoling updated ZOOKEEPER-3920:
-------------------------------
Fix Version/s: 3.6.2
> Zookeeper clients timeout after leader change
> ---------------------------------------------
>
> Key: ZOOKEEPER-3920
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3920
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.1
> Reporter: Andre Price
> Priority: Major
> Fix For: 3.6.2
>
> Attachments: stack.yml, zk_repro.zip
>
>
> [Sorry I believe this is a dupe of
> https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially
> https://issues.apache.org/jira/browse/ZOOKEEPER-3466
> But i am not able to attach files there for some reason so creating a new
> issue which hopefully allows me]
> We are encountering an issue where failing over from the leader results in
> zookeeper clients not being able to connect successfully. They timeout
> waiting for a response from the server. We are attempting to upgrade some
> existing zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but
> stating incase it helps with pinpointing issue) which is effectively blocked
> by this issue. We perform the rolling upgrade (followers first then leader
> last) and it seems to go successfully by all indicators. But we end up in the
> state described in this issue where if the leader changes (either due to
> restart or stopping) the cluster does not seem able to start new sessions.
> I've gathered some TRACE logs from our servers and will attach in the hopes
> they can help figure this out.
> Attached zk_repro.zip which contains the following:
> * zoo.cfg used in one of the instances (they are all the same except for the
> local server's ip being 0.0.0.0 in each)
> * zoo.cfg.dynamic.next (don't think this is used anywhere but is written by
> zookeeper at some point - I think when the first 3.6.1 container becomes
> leader based on the value – the file is in all containers and is the same in
> all as well)
> * s\{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of
> repro start indicated by "// REPRO START" text and whitespace in logs
> * repro_steps.txt - rough steps executed that result in the server logs
> attached
>
> I'll summarize the repro here also:
> # Initially it appears to be a healthy 3 node ensemble all running 3.6.1.
> Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is
> disabled.
> # invoke srvr on each node (to verify setup and also create bookmark in logs)
> # Do a zkCli get of /zookeeper/quota which succeeds
> # Do a restart of the leader (to same image/config) (server 2 now becomes
> leader, 3 is back as follower)
> # Try to perform the same zkCli get which times out (this get is done within
> the container)
> # Try to perform the same zkCli get but from another machine, this also
> times out
> # Invoke srvr on each node again (to verify that 2 is now the
> leader/bookmark)
> # Do a restart of server 2 (3 becomes leader, 2 follower)
> # Do a zkCli get of /zookeeper/quota which succeeds
> # Invoke srvr on each node again (to verify that 3 is leader)
> I tried to keep the other ZK traffic to a minimum but there are likely some
> periodic mntr requests mixed from our metrics scraper.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)