[ https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193369#comment-16193369 ]
Yicheng Fang commented on ZOOKEEPER-2899: ----------------------------------------- [~maoling] NaH, I have since shifted focus to other tasks because it didn't seem like I was getting close to reproducing what we saw in prod. It would be great if you have some ideas that we can try out. We're all ears. > Zookeeper not receiving packets after ZXID overflows > ---------------------------------------------------- > > Key: ZOOKEEPER-2899 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.5 > Environment: 5 host ensemble, 1500+ client connections each, 300K+ > nodes > OS: Ubuntu precise > JAVA 7 > JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver > 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz > 4 HDD 600G each > Reporter: Yicheng Fang > Attachments: GC_metric.png, image12.png, image13.png, > message_in_per_sec.png, metric_volume.png, zk_20170309_wo_noise.log > > > ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of > Kafka consumers writing consumption offsets to ZK. > We observed the issue two times within the last year. Each time after ZXID > overflowed, ZK was not receiving packets even though leader election looked > successful from the logs, and ZK servers were up. As a result, the whole > Kafka system came to a halt. > As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK > and Kafka clusters and feed them with like-production test traffic. Though > not really able to reproduce the issue, I did see that the Kafka consumers, > which used ZK clients, essentially DOSed the ensemble, filling up the > `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read > latencies. > More details are included in the comments. -- This message was sent by Atlassian JIRA (v6.4.14#64029)