[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

Oleksandr Shulgin (Jira) Wed, 27 Dec 2023 05:54:06 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oleksandr Shulgin updated KAFKA-16054:
--------------------------------------
    Description: 
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690 
nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run([email protected]/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286, 
u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.

  was:
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, "top -H" reports a number of "data-plane-kafka-..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690 
nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run([email protected]/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\\{events=EPOLLOUT, data={u32=12286, 
u64=4681111381628432382}}], 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.


> Sudden 100% CPU on a broker
> ---------------------------
>
>                 Key: KAFKA-16054
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16054
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 3.3.2, 3.6.1
>         Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB,  Amazon 
> Linux
>            Reporter: Oleksandr Shulgin
>            Priority: Critical
>              Labels: linux
>
> We have observed now for the 3rd time in production the issue where a Kafka 
> broker will suddenly jump to 100% CPU usage and will not recover on its own 
> until manually restarted.
> After a deeper investigation, we now believe that this is an instance of the 
> infamous epoll bug. See:
> [https://github.com/netty/netty/issues/327]
> [https://github.com/netty/netty/pull/565] (original workaround)
> [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
>  (same workaround in the current Netty code)
> The first occurrence in our production environment was on 2023-08-26 and the 
> other two — on 2023-12-10 and 2023-12-20.
> Each time the high CPU issue is also resulting in this other issue (misplaced 
> messages) I was asking about on the users mailing list in September, but to 
> date got not a single reply, unfortunately: 
> [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
> We still do not know how this other issue is happening.
> When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
> threads consuming ~60% user and ~40% system CPU, and the thread dump contains 
> a lot of stack traces like the following one:
> "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
>  #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s 
> tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000]
> java.lang.Thread.State: RUNNABLE
> #011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
> #011at 
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
> #011at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
> #011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
> #011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
> #011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
> #011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
> #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
> #011at kafka.network.Processor.poll(SocketServer.scala:1107)
> #011at kafka.network.Processor.run(SocketServer.scala:1011)
> #011at java.lang.Thread.run([email protected]/Thread.java:840)
> At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
> consider tuning tcp_mem".
> We are running relatively big machines in production — c6g.4xlarge with 30 GB 
> RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
> 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
> memory pages.
> We were able to reproduce the issue in our test environment (which is using 
> 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: 
> "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of 
> the busy Kafka threads shows the following syscalls repeating constantly:
> epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286, 
> u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
>  1024, 300, NULL, 8) = 1
> fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
> unavailable)
> Resetting the "tcp_mem" parameters back to the auto-configured values in the 
> test environment removes the pressure from the broker and it can continue 
> normally without restart.
> We have found a bug report here that suggests that an issue may be partially 
> due to a kernel bug: 
> [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
> (they are using version 5.15)
> We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
> reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
> parameters by a factor of 1,000. The JVM behavior is the same under these 
> conditions.
> A similar issue is reported here, affecting Kafka Connect:
> https://issues.apache.org/jira/browse/KAFKA-4739
> Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue 
> is present on both systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

Reply via email to