[ 
https://issues.apache.org/jira/browse/KAFKA-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925888#comment-17925888
 ] 

R Dheeraj commented on KAFKA-16996:
-----------------------------------

Hi [~chia7712], I'm new to Kafka open source. 

I did a deep-dive into the codebase for this issue: 

According to `NetworkClient.leastLoadedNode()`, it will first prioritize any 
nodes that are `READY` (e.g connectionStates is READY, channel is Ready and can 
send more requests based on the inflight requests based on 
`NetworkClient.canSendRequest()`) - (at this point node is alive)

2 priority: 
- Connection to the node is being established (at this point Node is alive)

Least Priority: 
- `ClusterConnectionStates.canConnect()` where the state is disconnected and 
time elapsed since last connection attempt is > configured reconnect bacoff 
timing --> the Node could be dead (as seen in [~goalfull] case)

I suspect with [~goalfull] case, all the nodes cannot accept new request (case 
1) and there are no new nodes for which connection is established (case 2). So, 
it could be the 3rd case where `Fetcher.
getTopicMetadata()#342` throws an exception as request has failed and not 
retriable. This could be causing `fetcher.getAllTopicMetadata()` to fail, 
causing client startup to fail. 

[~goalfull] Could you kindly confirm if my understanding is correct?

> The leastLoadedNode() function in kafka-client may choose a faulty node 
> during the consumer thread starting and meanwhile one of the KAFKA server 
> node is dead.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16996
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16996
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.0.1, 2.3.0, 3.6.0
>            Reporter: Goufu
>            Priority: Blocker
>
> The leastLoadedNode() function has a bug during the consumer process starting 
> period. The function sendMetadataRequest() called by 
> getTopicMetadataRequest() uses a random node which maybe faulty since every 
> node‘s state recorded in the client thread is not ready yet. It happened in 
> my production environment during my consumer thread restarting and meanwhile 
> one of the KAFKA server node is dead. Then the client startup failed. 
> I'm using the kafka-client-2.0.1.jar. I have checked the source code of 
> higher versions and the issue still exists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to