Adam Kafka created KAFKA-7349:
---------------------------------
Summary: Long Disk Writes cause Zookeeper Disconnects
Key: KAFKA-7349
URL: https://issues.apache.org/jira/browse/KAFKA-7349
Project: Kafka
Issue Type: Bug
Components: zkclient
Affects Versions: 0.11.0.1
Reporter: Adam Kafka
Attachments: SpikeInWriteTime.png
We run our Kafka cluster on a cloud service provider. As a consequence, we
notice a large tail latency write time that is out of our control. Some writes
take on the order of seconds. We have noticed that often these long write times
are correlated with subsequent Zookeeper disconnects from the brokers. It
appears that during the long write time, the Zookeeper heartbeat thread does
not get scheduled CPU time, resulting in a long gap of heartbeats sent. After
the write, the ZK thread does get scheduled CPU time, but it detects that it
has not received a heartbeat from Zookeeper in a while, so it drops its
connection then rejoins the cluster.
Note that the timeout reported is inconsistent with the timeout as set by the
configuration ({{zookeeper.session.timeout.ms}} = default of 6 seconds). We
have seen a range of values reported here, including 5950ms (less than
threshold), 12032ms (double the threshold), 25999ms (much larger than the
threshold).
We noticed that during a service degradation of the storage service of our
cloud provider, these Zookeeper disconnects increased drastically in frequency.
We are hoping there is a way to decouple these components. Do you agree with
our diagnosis that the ZK disconnects are occurring due to thread contention
caused by long disk writes? Perhaps the ZK thread could be scheduled at a
higher priority? Do you have any suggestions for how to avoid the ZK
disconnects?
Here is an example of one of these events:
Logs on the Broker:
{code}
[2018-08-25 04:10:19,695] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:21,697] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:23,700] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:25,701] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:27,702] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:29,704] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:31,707] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:33,709] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:35,712] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:37,714] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:39,716] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:41,719] DEBUG Got ping response for sessionid:
0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn)
...
[2018-08-25 04:10:53,752] WARN Client session timed out, have not heard from
server in 12032ms for sessionid 0x36202ab4337002c
(org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,754] INFO Client session timed out, have not heard from
server in 12032ms for sessionid 0x36202ab4337002c, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-08-25 04:10:53,920] INFO zookeeper state changed (Disconnected)
(org.I0Itec.zkclient.ZkClient)
[2018-08-25 04:10:53,920] INFO Waiting for keeper state SyncConnected
(org.I0Itec.zkclient.ZkClient)
...
{code}
GC logs during the same time (demonstrating this is not just a long GC):
{code}
2018-08-25T04:10:36.434+0000: 35150.779: [GC (Allocation Failure)
3074119K->2529089K(6223360K), 0.0137342 secs]
2018-08-25T04:10:37.367+0000: 35151.713: [GC (Allocation Failure)
3074433K->2528524K(6223360K), 0.0127938 secs]
2018-08-25T04:10:38.274+0000: 35152.620: [GC (Allocation Failure)
3073868K->2528357K(6223360K), 0.0131040 secs]
2018-08-25T04:10:39.220+0000: 35153.566: [GC (Allocation Failure)
3073701K->2528885K(6223360K), 0.0133247 secs]
2018-08-25T04:10:40.175+0000: 35154.520: [GC (Allocation Failure)
3074229K->2528639K(6223360K), 0.0127870 secs]
2018-08-25T04:10:41.084+0000: 35155.429: [GC (Allocation Failure)
3073983K->2530769K(6223360K), 0.0135058 secs]
2018-08-25T04:10:42.012+0000: 35156.358: [GC (Allocation Failure)
3076113K->2531772K(6223360K), 0.0165919 secs]
2018-08-25T04:10:53.737+0000: 35168.083: [GC (Allocation Failure)
3077116K->2529630K(6223360K), 0.0135440 secs]
2018-08-25T04:10:58.515+0000: 35172.860: [GC (Allocation Failure)
3074974K->2531144K(6223360K), 0.0164601 secs]
2018-08-25T04:11:03.149+0000: 35177.495: [GC (Allocation Failure)
3076488K->2531751K(6223360K), 0.0146978 secs]
{code}
The attached screenshot shows a spike in write time to our data volume at the
same second. Note that this an average value, so it does not represent the max
write time, only a sample of write times.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)