[
https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168208#comment-16168208
]
Yicheng Fang edited comment on ZOOKEEPER-2899 at 9/15/17 5:16 PM:
------------------------------------------------------------------
ZXID overflowed in prod:
We observed that the ensemble was not receiving any packets during the time of
outage, as can be seen in the attachment 'image12.pnp'. It was a grafana graph,
with data source from the four-letter word commands. In the meantime, node
count dropped by ~10000 and stayed flat at 302,500 after the overflow. The
aggregated log is attached as 'zk_20170309_wo_noise.log', which seems to tell
the that leader election was finished successfully, quorum formed, and ZK
servers started up.
However, we did see a lot of the following errors after the ZK servers went up:
{noformat}
2017-03-09 09:00:12,420 - ERROR [CommitProcessor:2:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
2017-03-09 09:00:13,210 - ERROR [CommitProcessor:1:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:594)
at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:966)
at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:818)
at
org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
at
org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
{noformat}
We mitigated the issue by restarting the ensemble, after which we see traffic
flowing into the ensemble and the whole system started recovering.
was (Author: eefangyicheng):
ZXID overflowed in prod:
We observed that the ensemble was not receiving any packets during the time of
outage, as can be seen in the attachment 'image12.pnp'. It was a grafana graph,
with data source from the four-letter word commands. In the meantime, node
count dropped by ~10000 and stayed flat at 302,500 after the overflow. The
aggregated log is attached as 'zk_20170309_wo_noise.log', which seems to tell
the that leader election was finished successfully, quorum formed, and ZK
servers started up.
However, we did see a lot of the following errors after the ZK servers went up:
```
2017-03-09 09:00:12,420 - ERROR [CommitProcessor:2:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
2017-03-09 09:00:13,210 - ERROR [CommitProcessor:1:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:594)
at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:966)
at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:818)
at
org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
at
org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
```
We mitigated the issue by restarting the ensemble, after which we see traffic
flowing into the ensemble and the whole system started recovering.
> Zookeeper not receiving packets after ZXID overflows
> ----------------------------------------------------
>
> Key: ZOOKEEPER-2899
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Environment: 5 host ensemble, 1500+ client connections each, 300K+
> nodes
> OS: Ubuntu precise
> JAVA 7
> JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
> 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
> 4 HDD 600G each
> Reporter: Yicheng Fang
> Attachments: image12.png, image13.png, zk_20170309_wo_noise.log
>
>
> ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of
> Kafka consumers writing consumption offsets to ZK.
> We observed the issue two times within the last year. Each time after ZXID
> overflowed, ZK was not receiving packets even though leader election looked
> successful from the logs, and ZK servers were up. As a result, the whole
> Kafka system came to a halt.
> As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK
> and Kafka clusters and feed them with like-production test traffic. Though
> not really able to reproduce the issue, I did see that the Kafka consumers,
> which used ZK clients, essentially DOSed the ensemble, filling up the
> `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read
> latencies.
> More details are included in the comments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)