[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921709#comment-16921709 ]
ASF GitHub Bot commented on KAFKA-7931: --------------------------------------- aravindvs commented on pull request #7288: KAFKA-7931 : [Proposal] Fix metadata fetch for ephemeral brokers behind a Virtual IP URL: https://github.com/apache/kafka/pull/7288 If we have ephemeral brokers sitting behind a Virtual IP and when all the brokers go down, the client won't be able to reconnect as mentioned in: https://issues.apache.org/jira/browse/KAFKA-7931. This is because we take the bootstrap nodes and completely forget about it once the first metadata response comes in (and then we create a new metadata cache and a new cluster). Now when all the brokers go down before the metadata is updated, then the client will be stuck unless it is rebooted. This patch simply stores the bootstrap brokers list. Instead of simply giving up when a 'leastLoadedNode' is not found, we simply use one of the bootstrap nodes to get the metadata. Also we can make sure to use the bootstrap nodes only when the bootstrap node is not part of the set of nodes on the cluster. Testing -------- * Manual Testing - Setup ephemeral brokers behind a VIP. Recreate all the ephemeral brokers (so that they change their IPs) * NetworkClient Unit Test - Test metadata with bootstrap - being the same as the node on the cluster and also different than the node on the cluster. Note: This doesn't change any existing system behavior and this code path will be hit only if we are unable to find any `leastLoadedNode` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > --------------------------------------------------------------------------------- > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.1.0 > Reporter: Brian > Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian Jira (v8.3.2#803003)