[ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804353#comment-17804353
 ] 

Oleksandr Shulgin commented on KAFKA-16054:
-------------------------------------------

I was trying to implement the same workaround in the Kafka's Selector class, 
but instead I found something that may explain the issue that we are seeing.

It seems that it is caused by the combination of:
1. Setting timeout to 0 in the poll() method, on condition 
"madeReadProgressLastCall && dataInBuffers": 
https://github.com/apache/kafka/blob/3.6/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L449-L450
2. Setting  madeReadProgressLastCall to "true", even when nothing was actually 
read from a channel: 
https://github.com/apache/kafka/blob/3.6/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L688

The flag madeReadProgressLastCall is apparently there to prevent Kafka user 
space to run out of memory due to all the buffered data, but in this case it 
causes kernel space to run out of TCP buffers...

Can someone more experienced with the code validate my assumptions, please?

> Sudden 100% CPU on a broker
> ---------------------------
>
>                 Key: KAFKA-16054
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16054
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 3.3.2, 3.6.1
>         Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB,  Amazon 
> Linux
>            Reporter: Oleksandr Shulgin
>            Priority: Critical
>              Labels: linux
>
> We have observed now for the 3rd time in production the issue where a Kafka 
> broker will suddenly jump to 100% CPU usage and will not recover on its own 
> until manually restarted.
> After a deeper investigation, we now believe that this is an instance of the 
> infamous epoll bug. See:
> [https://github.com/netty/netty/issues/327]
> [https://github.com/netty/netty/pull/565] (original workaround)
> [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
>  (same workaround in the current Netty code)
> The first occurrence in our production environment was on 2023-08-26 and the 
> other two — on 2023-12-10 and 2023-12-20.
> Each time the high CPU issue is also resulting in this other issue (misplaced 
> messages) I was asking about on the users mailing list in September, but to 
> date got not a single reply, unfortunately: 
> [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
> We still do not know how this other issue is happening.
> When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
> threads consuming ~60% user and ~40% system CPU, and the thread dump contains 
> a lot of stack traces like the following one:
> "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
>  #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s 
> tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000]
> java.lang.Thread.State: RUNNABLE
> #011at sun.nio.ch.EPoll.wait([email protected]/Native Method)
> #011at 
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:118)
> #011at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
> #011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
> #011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
> #011at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
> #011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
> #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
> #011at kafka.network.Processor.poll(SocketServer.scala:1107)
> #011at kafka.network.Processor.run(SocketServer.scala:1011)
> #011at java.lang.Thread.run([email protected]/Thread.java:840)
> At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
> consider tuning tcp_mem".
> We are running relatively big machines in production — c6g.4xlarge with 30 GB 
> RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
> 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
> memory pages.
> We were able to reproduce the issue in our test environment (which is using 
> 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: 
> "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of 
> the busy Kafka threads shows the following syscalls repeating constantly:
> epoll_pwait(15558, [\\\\\{events=EPOLLOUT, data={u32=12286, 
> u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
>  1024, 300, NULL, 8) = 1
> fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
> unavailable)
> Resetting the "tcp_mem" parameters back to the auto-configured values in the 
> test environment removes the pressure from the broker and it can continue 
> normally without restart.
> We have found a bug report here that suggests that an issue may be partially 
> due to a kernel bug: 
> [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
> (they are using version 5.15)
> We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to 
> reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
> parameters by a factor of 1,000. The JVM behavior is the same under these 
> conditions.
> A similar issue is reported here, affecting Kafka Connect:
> https://issues.apache.org/jira/browse/KAFKA-4739
> Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue 
> is present on both systems.
> The issue is also reproducible on JDK 11 (as you can see from the stack 
> trace, we are using 17).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to