[
https://issues.apache.org/jira/browse/KAFKA-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925888#comment-17925888
]
R Dheeraj commented on KAFKA-16996:
-----------------------------------
Hi [~chia7712], I'm new to Kafka open source.
I did a deep-dive into the codebase for this issue:
According to `NetworkClient.leastLoadedNode()`, it will first prioritize any
nodes that are `READY` (e.g connectionStates is READY, channel is Ready and can
send more requests based on the inflight requests based on
`NetworkClient.canSendRequest()`) - (at this point node is alive)
2 priority:
- Connection to the node is being established (at this point Node is alive)
Least Priority:
- `ClusterConnectionStates.canConnect()` where the state is disconnected and
time elapsed since last connection attempt is > configured reconnect bacoff
timing --> the Node could be dead (as seen in [~goalfull] case)
I suspect with [~goalfull] case, all the nodes cannot accept new request (case
1) and there are no new nodes for which connection is established (case 2). So,
it could be the 3rd case where `Fetcher.
getTopicMetadata()#342` throws an exception as request has failed and not
retriable. This could be causing `fetcher.getAllTopicMetadata()` to fail,
causing client startup to fail.
[~goalfull] Could you kindly confirm if my understanding is correct?
> The leastLoadedNode() function in kafka-client may choose a faulty node
> during the consumer thread starting and meanwhile one of the KAFKA server
> node is dead.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-16996
> URL: https://issues.apache.org/jira/browse/KAFKA-16996
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 2.0.1, 2.3.0, 3.6.0
> Reporter: Goufu
> Priority: Blocker
>
> The leastLoadedNode() function has a bug during the consumer process starting
> period. The function sendMetadataRequest() called by
> getTopicMetadataRequest() uses a random node which maybe faulty since every
> node‘s state recorded in the client thread is not ready yet. It happened in
> my production environment during my consumer thread restarting and meanwhile
> one of the KAFKA server node is dead. Then the client startup failed.
> I'm using the kafka-client-2.0.1.jar. I have checked the source code of
> higher versions and the issue still exists.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)