[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189102#comment-17189102
 ] 

maoling commented on ZOOKEEPER-3920:
------------------------------------

[~apriceq]

*1. Analysis*

*_1.1 reconfig disables_*
The root cause of ZOOKEEPER-3829 and ZOOKEEPER-3920 when reconfig disables is 
the inconsistency view of QuorumVerifier.
The time happens in ZOOKEEPER-3829 is when the rolling start/update of cluster, 
in ZOOKEEPER-3920 is the config of 0.0.0.0 in the docker env.
They both have the same propagation mechanism:
{code:java}
reconfig disables and leader election happens ---> QuorumVerifier is 
inconsistency ---> designatedLeader != self.getId() ---> allowedToCommit = 
false ---> cannot commit anything or create a session ---> client hangs and 
requests timeout
 
{code}
ZOOKEEPER-3829 fixes it by this approach: when reconfig disables, skip 
designatedLeader calculation and reconfig processing, and don't set 
setQuorumVerifier in the processReconfig method and the server can use its 
local configuration to connect the leader. So ZOOKEEPER-3829 can also fix the 
ZOOKEEPER-3920 issue when reconfig disables.


_*1.2 reconfig enables*_
*_---> However, the same config has issues when using reconfiguration 
(reconfigEnabled=true). In that case killing the leader makes the cluster 
unavailable - no leader is able to be elected._*

When reconfig enables, all the servers need to reach consensus on the same 
QuorumVerifier where the leader tells other servers his address is 0:0:0:0, so 
the follower cannot connect to leader and exit the following. It's really not 
easy to resolve in the zk side.

Overall, _*server.X = Y*_ config pair list should be identify, consistency and 
accessible.


*2. K8S:*
In the K8S official zk recommended 
guideline(https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/),
 these issues will never happen? because they use the config like this:
{code:java}
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888
{code}
*3. Docker*
_*--->Thinking through the handling of 0.0.0.0 as a server address. I think 
many configurations running in containers opt to use this as the server address 
because it simplifies the binding. Otherwise containers may need to run in the 
host network mode to successfully bind the publicly accessible IP of the 
container (the docker host's).*_

Docker Official zk Images(https://hub.docker.com/_/zookeeper) uses the 0.0.0.0 
address and another popular zk 
image(https://hub.docker.com/r/bitnami/zookeeper) also uses this way and it has 
a explanation:
{code:java}
You have to use 0.0.0.0 as the host for the server. More concretely, if the ID 
of the zookeeper1 container starting is 1, then the ZOO_SERVERS environment 
variable has to be 0.0.0.0:2888:3888,zookeeper2:2888:3888,zookeeper3:2888:3888 
or if the ID of zookeeper servers are non-sequential then they need to be 
specified 0.0.0.0:2888:3888::2,zookeeper2:2888:3888::4.zookeeper3:2888:3888::6
{code}
I edit the *stack.yml* file to change something like: *ZOO_SERVERS: 
server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 
server.3=zoo3:2888:3888;2181,* I found everything is OK and the cluster is in a 
healthy state.

I really don't understand the distinct advantage of using 0.0.0.0, otherwise 
the way I mention above?

> Zookeeper clients timeout after leader change
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-3920
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3920
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.1
>            Reporter: Andre Price
>            Priority: Major
>         Attachments: stack.yml, zk_repro.zip
>
>
> [Sorry I believe this is a dupe of 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3466 
> But i am not able to attach files there for some reason so creating a new 
> issue which hopefully allows me]
> We are encountering an issue where failing over from the leader results in 
> zookeeper clients not being able to connect successfully. They timeout 
> waiting for a response from the server. We are attempting to upgrade some 
> existing zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but 
> stating incase it helps with pinpointing issue) which is effectively blocked 
> by this issue. We perform the rolling upgrade (followers first then leader 
> last) and it seems to go successfully by all indicators. But we end up in the 
> state described in this issue where if the leader changes (either due to 
> restart or stopping) the cluster does not seem able to start new sessions.
> I've gathered some TRACE logs from our servers and will attach in the hopes 
> they can help figure this out. 
> Attached zk_repro.zip which contains the following:
>  * zoo.cfg used in one of the instances (they are all the same except for the 
> local server's ip being 0.0.0.0 in each)
>  * zoo.cfg.dynamic.next (don't think this is used anywhere but is written by 
> zookeeper at some point - I think when the first 3.6.1 container becomes 
> leader based on the value – the file is in all containers and is the same in 
> all as well)
>  * s\{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of 
> repro start indicated by "// REPRO START" text and whitespace in logs
>  * repro_steps.txt - rough steps executed that result in the server logs 
> attached
>  
> I'll summarize the repro here also:
>  # Initially it appears to be a healthy 3 node ensemble all running 3.6.1. 
> Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is 
> disabled.
>  # invoke srvr on each node (to verify setup and also create bookmark in logs)
>  # Do a zkCli get of /zookeeper/quota  which succeeds
>  # Do a restart of the leader (to same image/config) (server 2 now becomes 
> leader, 3 is back as follower)
>  # Try to perform the same zkCli get which times out (this get is done within 
> the container)
>  # Try to perform the same zkCli get but from another machine, this also 
> times out
>  # Invoke srvr on each node again (to verify that 2 is now the 
> leader/bookmark)
>  # Do a restart of server 2 (3 becomes leader, 2 follower)
>  # Do a zkCli get of /zookeeper/quota which succeeds
>  # Invoke srvr on each node again (to verify that 3 is leader)
> I tried to keep the other ZK traffic to a minimum but there are likely some 
> periodic mntr requests mixed from our metrics scraper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to