[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919614#comment-16919614 ]
Aravind Velamur Srinivasan commented on KAFKA-7931: --------------------------------------------------- Finally tracked this one. This seems to happen because the "bootstrap.servers" config is thrown away after the initial startup. Because of this, when all the ephemeral brokers fail, the metadata can never be resolved, as all the brokers on the 'clusterView' has changed their IPs. This just remains there forever unless the client is rebooted! This will be the case for deployments which use a VIP (Virtual IP - say, like a GCP LB-IP) to talk to the ephemeral brokers. To solve this I think we can simply cache the 'bootstrap servers' in the metadata cache and when we are unable to find the 'leastLoadedNode' for sending the metadata, we can use one of the IPs on the bootstrap.servers to fetch the metadata. Gist of the patch is something like this: {noformat} $ git diff clients/src/main/java/org/apache/kafka/clients/NetworkClient.java diff --git a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java index cf823f5..8c51c14 100644 --- a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java +++ b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java @@ -668,8 +668,15 @@ public class NetworkClient implements KafkaClient { if (found != null) log.trace("Found least loaded node {}", found); - else + else { log.trace("Least loaded node selection failed to find an available node"); + // instead of giving up get one of the bootstrap nodes + List<Node> bootStrapNodes = this.metadataUpdater.fetchBootStrapNodes(); + if (bootStrapNodes != null && !bootStrapNodes.isEmpty()) { + found = bootStrapNodes.get(0); + log.info("Found bootstrap node for metadata {}", found); + } + } return found; } @@ -951,6 +958,10 @@ public class NetworkClient implements KafkaClient { return metadata.fetch().nodes(); } + @Override + public List<Node> fetchBootStrapNodes() { + return metadata.getBootStrapNodes(); + } @Override public boolean isUpdateDue(long now) { return !this.hasFetchInProgress() && this.metadata.timeToNextUpdate(now) == 0; {noformat} Of course we can do more checks here to see: 1. If the bootstrap servers and the brokers list are different - this implies we are behind an ephemeral IP (we can either add a config option or we can get this from the metadata response and check if the list of brokers is different than the bootstrap one). 2. If the bootstrap server is connected and can be reached as well. It should be since we initiate a connection there. 3. Pick a random index rather than finding the first available bootstrap server node similar to the leastLoadedNode logic if have more than one VIP on the bootstrap.servers list. I can open a formal PR as well after adding tests and doing some cleanup of the patch. > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > --------------------------------------------------------------------------------- > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.1.0 > Reporter: Brian > Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian Jira (v8.3.2#803003)