[
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oleksandr Shulgin updated KAFKA-16054:
--------------------------------------
Description:
We have observed now for the 3rd time in production the issue where a Kafka
broker will suddenly jump to 100% CPU usage and will not recover on its own
until manually restarted.
After a deeper investigation, we now believe that this is an instance of the
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
(same workaround in the current Netty code)
The first occurrence in our production environment was on 2023-08-26 and the
other two — on 2023-12-10 and 2023-12-20.
Each time the high CPU issue is also resulting in this other issue (misplaced
messages) I was asking about on the users mailing list in September, but to
date got not a single reply, unfortunately:
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
We still do not know how this other issue is happening.
When the high CPU happens, top(1) reports a number of "data-plane-kafka..."
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a
lot of stack traces like the following one:
"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690
nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
#011at
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
#011at
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run([email protected]/Thread.java:840)
At the same time the Linux kernel reports repeatedly "TCP: out of memory –
consider tuning tcp_mem".
We are running relatively big machines in production — c6g.4xlarge with 30 GB
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB
memory pages.
We were able to reproduce the issue in our test environment (which is using 4x
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy
Kafka threads shows the following syscalls repeating constantly:
epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286,
u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily
unavailable)
Resetting the "tcp_mem" parameters back to the auto-configured values in the
test environment removes the pressure from the broker and it can continue
normally without restart.
We have found a bug report here that suggests that an issue may be partially
due to a kernel bug:
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335]
(they are using version 5.15)
We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to
reproduce the issue, but we can still do it by reducing all the of "tcp_mem"
parameters by a factor of 1,000. The JVM behavior is the same under these
conditions.
A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739
Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is
present on both systems.
was:
We have observed now for the 3rd time in production the issue where a Kafka
broker will suddenly jump to 100% CPU usage and will not recover on its own
until manually restarted.
After a deeper investigation, we now believe that this is an instance of the
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
(same workaround in the current Netty code)
The first occurrence in our production environment was on 2023-08-26 and the
other two — on 2023-12-10 and 2023-12-20.
Each time the high CPU issue is also resulting in this other issue (misplaced
messages) I was asking about on the users mailing list in September, but to
date got not a single reply, unfortunately:
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
We still do not know how this other issue is happening.
When the high CPU happens, "top -H" reports a number of "data-plane-kafka-..."
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a
lot of stack traces like the following one:
"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690
nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
#011at
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
#011at
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run([email protected]/Thread.java:840)
At the same time the Linux kernel reports repeatedly "TCP: out of memory –
consider tuning tcp_mem".
We are running relatively big machines in production — c6g.4xlarge with 30 GB
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB
memory pages.
We were able to reproduce the issue in our test environment (which is using 4x
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy
Kafka threads shows the following syscalls repeating constantly:
epoll_pwait(15558, [\\{events=EPOLLOUT, data={u32=12286,
u64=4681111381628432382}}], 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily
unavailable)
Resetting the "tcp_mem" parameters back to the auto-configured values in the
test environment removes the pressure from the broker and it can continue
normally without restart.
We have found a bug report here that suggests that an issue may be partially
due to a kernel bug:
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335]
(they are using version 5.15)
We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to
reproduce the issue, but we can still do it by reducing all the of "tcp_mem"
parameters by a factor of 1,000. The JVM behavior is the same under these
conditions.
A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739
Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is
present on both systems.
> Sudden 100% CPU on a broker
> ---------------------------
>
> Key: KAFKA-16054
> URL: https://issues.apache.org/jira/browse/KAFKA-16054
> Project: Kafka
> Issue Type: Bug
> Components: network
> Affects Versions: 3.3.2, 3.6.1
> Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB, Amazon
> Linux
> Reporter: Oleksandr Shulgin
> Priority: Critical
> Labels: linux
>
> We have observed now for the 3rd time in production the issue where a Kafka
> broker will suddenly jump to 100% CPU usage and will not recover on its own
> until manually restarted.
> After a deeper investigation, we now believe that this is an instance of the
> infamous epoll bug. See:
> [https://github.com/netty/netty/issues/327]
> [https://github.com/netty/netty/pull/565] (original workaround)
> [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
> (same workaround in the current Netty code)
> The first occurrence in our production environment was on 2023-08-26 and the
> other two — on 2023-12-10 and 2023-12-20.
> Each time the high CPU issue is also resulting in this other issue (misplaced
> messages) I was asking about on the users mailing list in September, but to
> date got not a single reply, unfortunately:
> [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
> We still do not know how this other issue is happening.
> When the high CPU happens, top(1) reports a number of "data-plane-kafka..."
> threads consuming ~60% user and ~40% system CPU, and the thread dump contains
> a lot of stack traces like the following one:
> "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
> #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s
> tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000]
> java.lang.Thread.State: RUNNABLE
> #011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
> #011at
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
> #011at
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
> #011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
> #011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
> #011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
> #011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
> #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
> #011at kafka.network.Processor.poll(SocketServer.scala:1107)
> #011at kafka.network.Processor.run(SocketServer.scala:1011)
> #011at java.lang.Thread.run([email protected]/Thread.java:840)
> At the same time the Linux kernel reports repeatedly "TCP: out of memory –
> consider tuning tcp_mem".
> We are running relatively big machines in production — c6g.4xlarge with 30 GB
> RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145
> 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB
> memory pages.
> We were able to reproduce the issue in our test environment (which is using
> 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10:
> "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of
> the busy Kafka threads shows the following syscalls repeating constantly:
> epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286,
> u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
> 1024, 300, NULL, 8) = 1
> fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily
> unavailable)
> Resetting the "tcp_mem" parameters back to the auto-configured values in the
> test environment removes the pressure from the broker and it can continue
> normally without restart.
> We have found a bug report here that suggests that an issue may be partially
> due to a kernel bug:
> [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335]
> (they are using version 5.15)
> We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to
> reproduce the issue, but we can still do it by reducing all the of "tcp_mem"
> parameters by a factor of 1,000. The JVM behavior is the same under these
> conditions.
> A similar issue is reported here, affecting Kafka Connect:
> https://issues.apache.org/jira/browse/KAFKA-4739
> Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue
> is present on both systems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)