[
https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168260#comment-16168260
]
Yicheng Fang edited comment on ZOOKEEPER-2899 at 9/15/17 5:54 PM:
------------------------------------------------------------------
We tried running 'ZxidRolloverTest' with different setups but failed to
reproduce the issue, so we decided to use same hardware as production. The
experiments below were using a 5-node ZK ensemble, with
zookeeper.testingonly.initialZxid set to high enough value:
1. With tiny scripts using kazoo, spawn off client processes that each
continuously randomly create ZK nodes and set data on them, generating the
same number of connections as production, while having another set of clients
randomly read data from the nodes.
- Result: ZXID overflowed. Leader election happen completed within 5
seconds. Short burst of errors was seen from the client side but the clients
recovered right after the election.
2. Set up a 85 node Kafka broker cluster, then trigger overflowing with the
same method as in 1.
- Result: same as 1. The Kafka brokers behaved normal.
3. Set up a test tool to generate ~100k/s messages, and as many consumers as
needed to generate the 1500-per-node connection count, for the Kafka cluster.
The consumers writes consumption offsets to ZK every 10ms.
- We noticed that after the ZXID overflowed for a couple of times, the whole
system began acting weirdly - metrics from the brokers became sporadic, ISRs
became flappy, metrics volume sent by Kafka dropped, etc. See attachment
'message_in_per_sec.png', 'metric_volume.png', 'GC_metric.png' for screenshots.
- From the 'srvr' stats, latency became '0/[>100]/[>200]', vs. in normal
conditions '0/0/[<100]'. Profiling ZK revealed that it was because the ensemble
received high QPS of write traffics (presumably from the Kafka consumers) such
that the 'submittedRequests' queue in 'PrepRequestProcessor' of the leader was
filled up, causing even the reads to have high latencies.
- It looked to us that somehow by electing a new leader when overflowing
caused the consumers to align, thus DDOSing the ensemble. However, we have not
observed the same behavior after bouncing the leader process BEFORE the
overflow. The ensemble should behave similarly in both cases since both call
for new leader elections. One difference though, we noticed, was that in the
overflow case the leader election port was left open so the downed leader would
participate in the new round of leader election. Not sure if it's related but
thought might be worth bringing up.
was (Author: eefangyicheng):
We tried running 'ZxidRolloverTest' with different setups but failed to
reproduce the issue, so we decided to use same hardware as production. The
experiments below were using a 5-node ZK ensemble, with
zookeeper.testingonly.initialZxid set to high enough value:
1. With tiny scripts using kazoo, spawn off client processes that each
continuously randomly create ZK nodes and set data on them, generating the
same number of connections as production, while having another set of clients
randomly read data from the nodes.
- Result: ZXID overflowed. Leader election happen completed within 5
seconds. Short burst of errors was seen from the client side but the clients
recovered right after the election.
2. Set up a 85 node Kafka broker cluster, then trigger overflowing with the
same method as in 1.
- Result: same as 1. The Kafka brokers behaved normal.
3. Set up a test tool to generate ~100k/s messages, and as many consumers as
needed to generate the 1500-per-node connection count, for the Kafka cluster.
The consumers writes consumption offsets to ZK every 10ms.
- We noticed that after the ZXID overflowed for a couple of times, the whole
system began acting weirdly - metrics from the brokers became sporadic, ISRs
became flappy, metrics volume sent by Kafka dropped, etc. See attachment
'message_in_per_sec.png', 'metric_volume.png', 'GC_metric.png' for screenshots.
- From the 'srvr' stats, latency became '0/{>100}/{>200}', vs. in normal
conditions '0/0/{<100}'. Profiling ZK revealed that it was because the ensemble
received high QPS of write traffics (presumably from the Kafka consumers) such
that the 'submittedRequests' queue in 'PrepRequestProcessor' of the leader was
filled up, causing even the reads to have high latencies.
- It looked to us that somehow by electing a new leader when overflowing
caused the consumers to align, thus DDOSing the ensemble. However, we have not
observed the same behavior after bouncing the leader process BEFORE the
overflow. The ensemble should behave similarly in both cases since both call
for new leader elections. One difference though, we noticed, was that in the
overflow case the leader election port was left open so the downed leader would
participate in the new round of leader election. Not sure if it's related but
thought might be worth bringing up.
> Zookeeper not receiving packets after ZXID overflows
> ----------------------------------------------------
>
> Key: ZOOKEEPER-2899
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Environment: 5 host ensemble, 1500+ client connections each, 300K+
> nodes
> OS: Ubuntu precise
> JAVA 7
> JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
> 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
> 4 HDD 600G each
> Reporter: Yicheng Fang
> Attachments: GC_metric.png, image12.png, image13.png,
> message_in_per_sec.png, metric_volume.png, zk_20170309_wo_noise.log
>
>
> ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of
> Kafka consumers writing consumption offsets to ZK.
> We observed the issue two times within the last year. Each time after ZXID
> overflowed, ZK was not receiving packets even though leader election looked
> successful from the logs, and ZK servers were up. As a result, the whole
> Kafka system came to a halt.
> As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK
> and Kafka clusters and feed them with like-production test traffic. Though
> not really able to reproduce the issue, I did see that the Kafka consumers,
> which used ZK clients, essentially DOSed the ensemble, filling up the
> `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read
> latencies.
> More details are included in the comments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)