Oleksandr Shulgin created KAFKA-16054:
-----------------------------------------

             Summary: Sudden 100% CPU on a broker
                 Key: KAFKA-16054
                 URL: https://issues.apache.org/jira/browse/KAFKA-16054
             Project: Kafka
          Issue Type: Bug
          Components: network
    Affects Versions: 3.6.1, 3.3.2
         Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB,  Amazon 
Linux
            Reporter: Oleksandr Shulgin


We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, "top {-}H" reports a number of 
"data-plane-kafka{-}..." threads consuming ~60% user and ~40% system CPU, and 
the thread dump contains a lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690 
nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\{events=EPOLLOUT, data={u32=12286, 
u64=4681111381628432382}}], 1024, 300, NULL, 8) = 1
fstat(12019,

{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}

) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to