ko christ created ZOOKEEPER-3871:
------------------------------------

             Summary: Dockerized Zookeeper clients fail on Zookeeper leader 
changes
                 Key: ZOOKEEPER-3871
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3871
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.8, 3.6.1, 3.5.5
            Reporter: ko christ


h2. Description

In a nutshell, my dockerized Zookeeper installation stops working on cluster 
leader changes.

The cluster responds to 4-letter commands but when I force a leader change, the 
clients timeout like forever. A workaround is to run follow up restarts which 
resolve the issue, usually when the leader returns to the previous state. This 
affects the high availability of the cluster.
h2. Example

For example, assuming that a 3-node ZK cluster has the following initial state 
(*State A*). All Zookeeper clients work fine in this state.
||ZK 1||ZK 2||ZK 3||
|follower|follower|*leader*|

 

and a restart occurs and Zookeeper ends up to this (*State B*)
||ZK 1||ZK 2||ZK 3||
|follower|*leader*|follower|

In State B, all client attempts fail to connect and they timeout, like forever. 
Follow up leader restarts may resolve the issue, usually (but not always) due 
to a *return to the previous state A*. 
h2. Affected versions

I have verified that this bug using
 * *{{3.5.5}}*
 * *{{3.5.8}}*
 * *{{3.6.1}}*

h2. Reproduce

{color:#de350b}Note: On all the examples above replace tortoise with your 
hostname.{color}

Deploy a 3-node Zookeeper cluster (could be 5-node) using the official 3.5.8 
image.
{code:java}
docker run -d --name=zkcl01 -p 1493:1493 -p 1494:1494 -p 1495:1495 -h 
tortoise-zkcl01 -e HOSTNAME=tortoise -e ZOO_PORT=1493 -e 
ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e 
ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False 
-e ZOO_SERVERS="server.1=0.0.0.0:1495:1494;1493 
server.2=tortoise:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e 
ZOO_MY_ID=1 zookeeper:3.5.8
docker run -d --name=zkcl02 -p 1496:1496 -p 1497:1497 -p 1498:1498 -h 
tortoise-zkcl02 -e HOSTNAME=tortoise -e ZOO_PORT=1496 -e 
ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e 
ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False 
-e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 
server.2=0.0.0.0:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e 
ZOO_MY_ID=2 zookeeper:3.5.8
docker run -d --name=zkcl03 -p 1499:1499 -p 1500:1500 -p 1501:1501 -h 
tortoise-zkcl03 -e HOSTNAME=tortoise -e ZOO_PORT=1499 -e 
ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e 
ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False 
-e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 
server.2=tortoise:1498:1497;1496 server.3=0.0.0.0:1501:1500;1499" -e 
ZOO_MY_ID=3 zookeeper:3.5.8
{code}
 

Monitor cluster's state with the 4-letter {{srvr}} command
{code:java}
watch -n 1 'for i in 1493 1496 1499; do echo $i; echo srvr | nc tortoise $i ; 
echo; done'{code}
 

Verify that you can connect to the cluster successfully using any client 
(zkCli.sh in this case)
{code:java}
docker exec -ti zkcl01 bin/zkCli.sh -server 
tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
WatchedEvent state:SyncConnected type:None path:null
[zookeeper]{code}
 

Stop/Start the leader node (based on {{srvr}} output from the previous step) in 
order to force a leader change.
{code:java}
docker stop zkcl03; sleep 15; docker start zkcl03{code}
 

Verify that the client now fails to connect and they timeout.
{code:java}
docker exec -ti zkcl01 bin/zkCli.sh -server 
tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
closing socket connection and attempting reconnect
KeeperErrorCode = ConnectionLoss for /{code}
 

Finally, -restart- stop/sleep/start the leader a few more times only to verify 
that the client succeeds usually when the leader goes back to the initial state.

 

This must be a bug unless there is a misconfiguration that I am missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to