[
https://issues.apache.org/jira/browse/KAFKA-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chia-Ping Tsai resolved KAFKA-17061.
------------------------------------
Fix Version/s: 3.9.0
Resolution: Fixed
> KafkaController takes long time to connect to newly added broker after
> registration on large cluster
> ----------------------------------------------------------------------------------------------------
>
> Key: KAFKA-17061
> URL: https://issues.apache.org/jira/browse/KAFKA-17061
> Project: Kafka
> Issue Type: Improvement
> Reporter: Haruki Okada
> Assignee: Haruki Okada
> Priority: Major
> Fix For: 3.9.0
>
> Attachments: flame-patched.html, flame.html,
> image-2024-07-02-17-22-06-100.png, image-2024-07-02-17-24-11-861.png,
> screenshot-flame-patched.png, screenshot-flame.png
>
>
> h2. Environment
> * Kafka version: 3.3.2
> * Cluster: 200~ brokers
> * Total num partitions: 40k
> * ZK-based cluster
> h2. Phenomenon
> When a broker left the cluster once due to the long STW and came back after a
> while, the controller took 6 seconds until connecting to the broker after
> znode registration, it caused significant message delivery delay.
> {code:java}
> [2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2,
> deleted brokers: , bounced brokers: , all live brokers: 1,...
> (kafka.controller.KafkaController)
> [2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller
> 1 trying to connect to broker 2 (kafka.controller.ControllerChannelManager)
> [2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting
> (kafka.controller.RequestSendThread)
> [2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback
> for 2 (kafka.controller.KafkaController)
> [2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller
> 1 connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change
> requests (kafka.controller.RequestSendThread)
> {code}
> h2. Analysis
> From the flamegraph at that time, we can see that
> [liveBrokerIds|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/controller/ControllerContext.scala#L217]
> called by `isReplicaOnline` takes significant time in
> `addUpdateMetadataRequestForBrokers` invocation on broker startup.
> !image-2024-07-02-17-24-11-861.png|width=541,height=303!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)