[jira] [Commented] (ZOOKEEPER-2899) Zookeeper not receiving packets after ZXID overflows

Yicheng Fang (JIRA) Thu, 05 Oct 2017 11:19:34 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193369#comment-16193369
 ]


Yicheng Fang commented on ZOOKEEPER-2899:
-----------------------------------------

[~maoling] NaH, I have since shifted focus to other tasks because it didn't 
seem like I was getting close to reproducing what we saw in prod.

It would be great if you have some ideas that we can try out. We're all ears.

> Zookeeper not receiving packets after ZXID overflows
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-2899
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: 5 host ensemble, 1500+ client connections each, 300K+ 
> nodes
> OS: Ubuntu precise
> JAVA 7
> JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
> 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
> 4 HDD 600G each 
>            Reporter: Yicheng Fang
>         Attachments: GC_metric.png, image12.png, image13.png, 
> message_in_per_sec.png, metric_volume.png, zk_20170309_wo_noise.log
>
>
> ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of 
> Kafka consumers writing  consumption offsets to ZK.
> We observed the issue two times within the last year. Each time after ZXID 
> overflowed, ZK was not receiving packets even though leader election looked 
> successful from the logs, and ZK servers were up. As a result, the whole 
> Kafka system came to a halt.
> As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK 
> and Kafka clusters and feed them with like-production test traffic. Though 
> not really able to reproduce the issue, I did see that the Kafka consumers, 
> which used ZK clients, essentially DOSed the ensemble, filling up the 
> `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read 
> latencies.
> More details are included in the comments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ZOOKEEPER-2899) Zookeeper not receiving packets after ZXID overflows

Reply via email to