[
https://issues.apache.org/jira/browse/KAFKA-17751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887957#comment-17887957
]
Gaurav Narula commented on KAFKA-17751:
---------------------------------------
I think this is a regression introduced by KAFKA-16534, in the changes to
[KafkaRaftClient::pollVoterAsFollower|https://github.com/apache/kafka/pull/16837/files#diff-1da15c51e641ea46ea5c86201ab8f21cfee9e7c575102a39c7bae0d5ffd7de39R3023].
In the scenario described above, the follower always hits the else block, and
{{state.remainingUpdateVoterPeriodMs}} eventually returns {{0}}, thereby
resulting in {{KafkaRaftClient::poll()}} having a {{pollTimeoutMs}} of {{0}}.
This causes the call to {{messageQueue.poll}} to not block and thereby results
in a busy-loop which causes high CPU load
> Contoller high CPU when formatted with --initial-controllers
> -------------------------------------------------------------
>
> Key: KAFKA-17751
> URL: https://issues.apache.org/jira/browse/KAFKA-17751
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Affects Versions: 3.9.0
> Reporter: Juha Mynttinen
> Assignee: Gaurav Narula
> Priority: Major
> Labels: kraft
> Attachments: Screenshot 2024-10-09 at 9.15.06.png, c1.properties,
> c2.properties, c3.properties
>
>
> Hey,
> I'm using 3.9.0 RC0.
> The issue only affects kraft.
> I noticed that formatting a simple three node controller cluster with
> --initial-controllers and starting the controller leads to a situation where
> the non-leader voters consume a lot of CPU.
> Here are the steps to reproduce. The needed configuration files are attached.
> Clean up and setup the environment.
> rm -rf /tmp/controllers && \
> mkdir -p /tmp/controllers/c1 && \
> mkdir -p /tmp/controllers/c2 && \
> mkdir -p /tmp/controllers/c3
> export KAFKA_HOME=<your_kafka_3_9_home>
> Format the controllers
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id
> 00000000-0000-0000-0000-000000000001 --initial-controllers
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
> --config c1.properties
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id
> 00000000-0000-0000-0000-000000000001 --initial-controllers
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
> --config c2.properties
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id
> 00000000-0000-0000-0000-000000000001 --initial-controllers
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
> --config c3.properties
> Start the controllers, in separate terminals
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka
> c1.properties
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka
> c2.properties
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka
> c3.properties
> Observe two of the controllers have CPU usage at 100%. If you check which PID
> is which, you can see that it's the two processes that are voters that have
> elevated CPU. The CPU usage of the leader is fine.
> I did in an slightly different environment some profiling. The screenshot is
> attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)