[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan-Philip Gehrcke updated ZOOKEEPER-3466:
------------------------------------------
    Description: 
Hey, we explore switching from ZooKeeper 3.4.14 to ZooKeeper 3.5.5 in 
[DC/OS|[https://github.com/dcos/dcos]].

DC/OS coordinates ZooKeeper via Exhibitor. We are not changing anything w.r.t. 
Exhibitor for now, and are hoping that we can use ZooKeeper 3.5.5 as a drop-in 
replacement for 3.4.14. This seems to work fine when Exhibitor uses a so-called 
static ensemble where the individual ZooKeeper instances are known a priori.

When Exhibitor however discovers individual ZooKeeper instances ("dynamic" 
back-end) then I think we observe a regression where ZooKeeper 3.5.5 can get 
into the following bad state (often, but not always):
 # three ZooKeeper instances find each other, leader election takes place 
(*expected*)
 # leader election succeeds: two followers, one leader (*expected*)
 # all three ZK instances respond IAMOK to RUOK  (*expected*)
 # all three ZK instances respond to SRVR (one says "Mode: leader", the other 
two say "Mode: follower")  (*expected*)
 # all three ZK instances respond to MNTR and show plausible output (*expected*)
 # *{color:#ff0000}Unexpected:{color}* any ZooKeeper client trying to connect 
to any of the three nodes observes a "connection timeout", whereas notably this 
is *not* a TCP connect() timeout. The TCP connect() succeeds, but then ZK does 
not seem to send the bytes to the TCP connection, and the ZK clients wait for 
them via recv() until it hits a timeout condition. Examples for two different 
clients:
 ## In Kazoo we specifically hit _Connection time-out: socket time-out during 
read_
generated here: 
[https://github.com/python-zk/kazoo/blob/88b657a0977161f3815657878ba48f82a97a3846/kazoo/protocol/connection.py#L249]
 ## In zkCli we see 

{code:java}
Client session timed out, have not heard from server in 15003ms for sessionid 
0x0, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn:main-SendThread(localhost:2181)){code}

 # This state is stable, it will last forever (well, at least for multiple 
hours and we didn't test longer than that).
 # In our system the ZooKeeper clients are crash-looping. They retry. What I 
have observed is that while they retry the ZK ensemble accumulates outstanding 
requests, here shown from MNTR output: 

{code:java}
zk_packets_received 2008
 zk_packets_sent 127
 zk_num_alive_connections 18
 zk_outstanding_requests 1880{code}

 # The leader emits log lines confirming session timeout, example:

{code:java}
[myid:3] INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 
0x2000642b18f0020, timeout of 10000ms exceeded
[myid:3] INFO [SessionTracker:QuorumZooKeeperServer@157] - Submitting global 
closeSession request for session 0x2000642b18f0020{code}
 # In this state, restarting any one of the two ZK followers results in the 
same state (clients don't get data from ZK upon connect).
 # In this state, restarting the ZK leader, and therefore triggering a leader 
re-election, almost immediately results in all clients being able to connect to 
all ZK instances successfully.

  was:
Hey, we explore switching from ZooKeeper 3.4.14 to ZooKeeper 3.5.5 in 
[DC/OS|[https://github.com/dcos/dcos]].

DC/OS coordinates ZooKeeper via Exhibitor. We are not changing anything w.r.t. 
Exhibitor for now, and are hoping that we can use ZooKeeper 3.5.5 as a drop-in 
replacement for 3.4.14. This seems to work fine when Exhibitor uses a so-called 
static ensemble where the individual ZooKeeper instances are known a priori.

When Exhibitor however discovers individual ZooKeeper instances ("dynamic" 
back-end) then I think we observe a regression where ZooKeeper 3.5.5 can get 
into the following bad state (often, but not always):
 # three ZooKeeper instances find each other, leader election takes place 
(*expected*)
 # leader election succeeds: two followers, one leader (*expected*)
 # all three ZK instances respond IAMOK to RUOK  (*expected*)
 # all three ZK instances respond to SRVR (one says "Mode: leader", the other 
two say "Mode: follower")  (*expected*)
 # all three ZK instances respond to MNTR and show plausible output (*expected*)
 # *{color:#FF0000}Unexpected:{color}* any ZooKeeper client trying to connect 
to any of the three nodes observes a "connection timeout", whereas notably this 
is *not* a TCP connect() timeout. The TCP connect() succeeds, but then ZK does 
not seem to send the bytes to the TCP connection, and the ZK clients wait for 
them via recv() until it hits a timeout condition. Examples for two different 
clients:
 ## In Kazoo we specifically hit
Connection time-out: socket time-out during readgenerated here: 
[https://github.com/python-zk/kazoo/blob/88b657a0977161f3815657878ba48f82a97a3846/kazoo/protocol/connection.py#L249]
 ## In zkCli we see 
Client session timed out, have not heard from server in 15003ms for sessionid 
0x0, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn:main-SendThread(localhost:2181))
 # This state is stable, it will last forever (well, at least for multiple 
hours and we didn't test longer than that).
 # In our system the ZooKeeper clients are crash-looping. They retry. What I 
have observed is that while they retry the ZK ensemble accumulates outstanding 
requests, here shown from MNTR output: 
zk_packets_received     2008
zk_packets_sent 127
zk_num_alive_connections        18
zk_outstanding_requests 1880
 # The leader emits log lines confirming session timeout, example:

{code:java}
[myid:3] INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 
0x2000642b18f0020, timeout of 10000ms exceeded
[myid:3] INFO [SessionTracker:QuorumZooKeeperServer@157] - Submitting global 
closeSession request for session 0x2000642b18f0020{code}

 # In this state, restarting any one of the two ZK followers results in the 
same state (clients don't get data from ZK upon connect).
 # In this state, restarting the ZK leader, and therefore triggering a leader 
re-election, almost immediately results in all clients being able to connect to 
all ZK instances successfully.


> ZK cluster converges, but does not properly handle client connections (new in 
> 3.5.5)
> ------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3466
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3466
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.5
>         Environment: Linux
>            Reporter: Jan-Philip Gehrcke
>            Priority: Major
>
> Hey, we explore switching from ZooKeeper 3.4.14 to ZooKeeper 3.5.5 in 
> [DC/OS|[https://github.com/dcos/dcos]].
> DC/OS coordinates ZooKeeper via Exhibitor. We are not changing anything 
> w.r.t. Exhibitor for now, and are hoping that we can use ZooKeeper 3.5.5 as a 
> drop-in replacement for 3.4.14. This seems to work fine when Exhibitor uses a 
> so-called static ensemble where the individual ZooKeeper instances are known 
> a priori.
> When Exhibitor however discovers individual ZooKeeper instances ("dynamic" 
> back-end) then I think we observe a regression where ZooKeeper 3.5.5 can get 
> into the following bad state (often, but not always):
>  # three ZooKeeper instances find each other, leader election takes place 
> (*expected*)
>  # leader election succeeds: two followers, one leader (*expected*)
>  # all three ZK instances respond IAMOK to RUOK  (*expected*)
>  # all three ZK instances respond to SRVR (one says "Mode: leader", the other 
> two say "Mode: follower")  (*expected*)
>  # all three ZK instances respond to MNTR and show plausible output 
> (*expected*)
>  # *{color:#ff0000}Unexpected:{color}* any ZooKeeper client trying to connect 
> to any of the three nodes observes a "connection timeout", whereas notably 
> this is *not* a TCP connect() timeout. The TCP connect() succeeds, but then 
> ZK does not seem to send the bytes to the TCP connection, and the ZK clients 
> wait for them via recv() until it hits a timeout condition. Examples for two 
> different clients:
>  ## In Kazoo we specifically hit _Connection time-out: socket time-out during 
> read_
> generated here: 
> [https://github.com/python-zk/kazoo/blob/88b657a0977161f3815657878ba48f82a97a3846/kazoo/protocol/connection.py#L249]
>  ## In zkCli we see 
> {code:java}
> Client session timed out, have not heard from server in 15003ms for sessionid 
> 0x0, closing socket connection and attempting reconnect 
> (org.apache.zookeeper.ClientCnxn:main-SendThread(localhost:2181)){code}
>  # This state is stable, it will last forever (well, at least for multiple 
> hours and we didn't test longer than that).
>  # In our system the ZooKeeper clients are crash-looping. They retry. What I 
> have observed is that while they retry the ZK ensemble accumulates 
> outstanding requests, here shown from MNTR output: 
> {code:java}
> zk_packets_received 2008
>  zk_packets_sent 127
>  zk_num_alive_connections 18
>  zk_outstanding_requests 1880{code}
>  # The leader emits log lines confirming session timeout, example:
> {code:java}
> [myid:3] INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 
> 0x2000642b18f0020, timeout of 10000ms exceeded
> [myid:3] INFO [SessionTracker:QuorumZooKeeperServer@157] - Submitting global 
> closeSession request for session 0x2000642b18f0020{code}
>  # In this state, restarting any one of the two ZK followers results in the 
> same state (clients don't get data from ZK upon connect).
>  # In this state, restarting the ZK leader, and therefore triggering a leader 
> re-election, almost immediately results in all clients being able to connect 
> to all ZK instances successfully.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to