[ 
https://issues.apache.org/jira/browse/KAFKA-17751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887957#comment-17887957
 ] 

Gaurav Narula commented on KAFKA-17751:
---------------------------------------

I think this is a regression introduced by KAFKA-16534, in the changes to 
[KafkaRaftClient::pollVoterAsFollower|https://github.com/apache/kafka/pull/16837/files#diff-1da15c51e641ea46ea5c86201ab8f21cfee9e7c575102a39c7bae0d5ffd7de39R3023].

In the scenario described above, the follower always hits the else block, and 
{{state.remainingUpdateVoterPeriodMs}} eventually returns {{0}}, thereby 
resulting in {{KafkaRaftClient::poll()}} having a {{pollTimeoutMs}} of {{0}}. 
This causes the call to {{messageQueue.poll}} to not block and thereby results 
in a busy-loop which causes high CPU load

> Contoller high CPU when formatted with --initial-controllers 
> -------------------------------------------------------------
>
>                 Key: KAFKA-17751
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17751
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.9.0
>            Reporter: Juha Mynttinen
>            Assignee: Gaurav Narula
>            Priority: Major
>              Labels: kraft
>         Attachments: Screenshot 2024-10-09 at 9.15.06.png, c1.properties, 
> c2.properties, c3.properties
>
>
> Hey,
> I'm using 3.9.0 RC0.
> The issue only affects kraft.
> I noticed that formatting a simple three node controller cluster with 
> --initial-controllers and starting the controller leads to a situation where 
> the non-leader voters consume a lot of CPU.
> Here are the steps to reproduce. The needed configuration files are attached.
> Clean up and setup the environment.
> rm -rf /tmp/controllers && \
> mkdir -p /tmp/controllers/c1 && \
> mkdir -p /tmp/controllers/c2 && \
> mkdir -p /tmp/controllers/c3
> export KAFKA_HOME=<your_kafka_3_9_home>
> Format the controllers
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
> 00000000-0000-0000-0000-000000000001 --initial-controllers 
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
>  --config c1.properties
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
> 00000000-0000-0000-0000-000000000001 --initial-controllers 
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
>  --config c2.properties
> $KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
> 00000000-0000-0000-0000-000000000001 --initial-controllers 
> 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
>  --config c3.properties
> Start the controllers, in separate terminals
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka 
> c1.properties
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka 
> c2.properties
> $KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka 
> c3.properties
> Observe two of the controllers have CPU usage at 100%. If you check which PID 
> is which, you can see that it's the two processes that are voters that have 
> elevated CPU. The CPU usage of the leader is fine.
> I did in an slightly different environment some profiling. The screenshot is 
> attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to