[jira] [Commented] (ZOOKEEPER-3920) Zookeeper clients timeout after leader change

DaoThanhTung (Jira) Sat, 29 Aug 2020 02:22:31 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186921#comment-17186921
 ]


DaoThanhTung commented on ZOOKEEPER-3920:
-----------------------------------------

[~apriceq] can you please share the workaround config for me. I cannot start 
zookeeper cluster without 0.0.0.0:
My tried workaround config:
{code:java}
server.1=192.168.99.100:2888:3888;2181
server.2=192.168.99.101:2888:3888;2181
server.3=192.168.99.102:2888:3888;2181{code}
result:
{code:java}
ERROR [zkId2/192.168.99.101:3888:QuorumCnxManager$Listener@958] - Exception 
while listening
java.net.BindException: Cannot assign requested address (Bind failed){code}
My origin config can start the cluster, but will encounter this bug if I 
restart leader:

 
{code:java}
server.1=192.168.99.100:2888:3888;2181
server.2=0.0.0.0:2888:3888;2181
server.3=192.168.99.102:2888:3888;2181{code}
 

> Zookeeper clients timeout after leader change
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-3920
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3920
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.1
>            Reporter: Andre Price
>            Priority: Major
>         Attachments: stack.yml, zk_repro.zip
>
>
> [Sorry I believe this is a dupe of 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3466 
> But i am not able to attach files there for some reason so creating a new 
> issue which hopefully allows me]
> We are encountering an issue where failing over from the leader results in 
> zookeeper clients not being able to connect successfully. They timeout 
> waiting for a response from the server. We are attempting to upgrade some 
> existing zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but 
> stating incase it helps with pinpointing issue) which is effectively blocked 
> by this issue. We perform the rolling upgrade (followers first then leader 
> last) and it seems to go successfully by all indicators. But we end up in the 
> state described in this issue where if the leader changes (either due to 
> restart or stopping) the cluster does not seem able to start new sessions.
> I've gathered some TRACE logs from our servers and will attach in the hopes 
> they can help figure this out. 
> Attached zk_repro.zip which contains the following:
>  * zoo.cfg used in one of the instances (they are all the same except for the 
> local server's ip being 0.0.0.0 in each)
>  * zoo.cfg.dynamic.next (don't think this is used anywhere but is written by 
> zookeeper at some point - I think when the first 3.6.1 container becomes 
> leader based on the value – the file is in all containers and is the same in 
> all as well)
>  * s\{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of 
> repro start indicated by "// REPRO START" text and whitespace in logs
>  * repro_steps.txt - rough steps executed that result in the server logs 
> attached
>  
> I'll summarize the repro here also:
>  # Initially it appears to be a healthy 3 node ensemble all running 3.6.1. 
> Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is 
> disabled.
>  # invoke srvr on each node (to verify setup and also create bookmark in logs)
>  # Do a zkCli get of /zookeeper/quota  which succeeds
>  # Do a restart of the leader (to same image/config) (server 2 now becomes 
> leader, 3 is back as follower)
>  # Try to perform the same zkCli get which times out (this get is done within 
> the container)
>  # Try to perform the same zkCli get but from another machine, this also 
> times out
>  # Invoke srvr on each node again (to verify that 2 is now the 
> leader/bookmark)
>  # Do a restart of server 2 (3 becomes leader, 2 follower)
>  # Do a zkCli get of /zookeeper/quota which succeeds
>  # Invoke srvr on each node again (to verify that 3 is leader)
> I tried to keep the other ZK traffic to a minimum but there are likely some 
> periodic mntr requests mixed from our metrics scraper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3920) Zookeeper clients timeout after leader change

Reply via email to