[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921709#comment-16921709
 ] 

ASF GitHub Bot commented on KAFKA-7931:
---------------------------------------

aravindvs commented on pull request #7288: KAFKA-7931 : [Proposal] Fix metadata 
fetch for ephemeral brokers behind a Virtual IP
URL: https://github.com/apache/kafka/pull/7288
 
 
   If we have ephemeral brokers sitting behind a Virtual IP and when all the 
brokers go down, the client won't be able to reconnect as mentioned in: 
https://issues.apache.org/jira/browse/KAFKA-7931. This is because we take the 
bootstrap nodes and completely forget about it once the first metadata response 
comes in (and then we create a new metadata cache and a new cluster). Now when 
all the brokers go down before the metadata is updated, then the client will be 
stuck unless it is rebooted. 
   
   This patch simply stores the bootstrap brokers list. Instead of simply 
giving up when a 'leastLoadedNode' is not found, we simply use one of the 
bootstrap nodes to get the metadata. Also we can make sure to use the bootstrap 
nodes only when the bootstrap node is not part of the set of nodes on the 
cluster.
   
   Testing
   --------
   * Manual Testing - Setup ephemeral brokers behind a VIP. Recreate all the 
ephemeral brokers (so that they change their IPs)
   * NetworkClient Unit Test - Test metadata with bootstrap - being the same as 
the node on the cluster and also different than the node on the cluster.
   
   Note: This doesn't change any existing system behavior and this code path 
will be hit only if we are unable to find any `leastLoadedNode`
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> ---------------------------------------------------------------------------------
>
>                 Key: KAFKA-7931
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7931
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.1.0
>            Reporter: Brian
>            Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to