Dmitry Mischenko created KAFKA-10075: ----------------------------------------
Summary: Kafka client stucks after Kafka-cluster unavailability Key: KAFKA-10075 URL: https://issues.apache.org/jira/browse/KAFKA-10075 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.4.0 Environment: Kafka v2.3.1 deployed by https://strimzi.io/ to Kubernetes cluster openjdk version "1.8.0_242" OpenJDK Runtime Environment (build 1.8.0_242-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) Reporter: Dmitry Mischenko Several times we got an issue with kafka-client. What happened: We have Kafka v2.3.1 deployed by [https://strimzi.io/] to Kubernetes cluster (Amazon EKS). # Kafka brokers were unavailable (due to cluster upgrade) and couldn't be resolved by internal hostnames ```2020-05-28 17:19:50 WARN NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -1 rack: null)2020-05-28 17:19:50 WARN NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -1 rack: null)at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:289)at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:538)at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:335)" at java.base/java.net.InetAddress.getAllByName(Unknown Source)"at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)" at java.base/java.net.InetAddress.getAllByName(Unknown Source)"" at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104)at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)at java.base/java.net.InetAddress.getAllByName0(Unknown Source)at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:444)at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)2020-05-28 17:19:50 WARN NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-1.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -2 rack: null)at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:955)" at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)``` # But after the moment when cluster was repaired, kafka-admin-client couldn't restore connection and only every 120s was throwing timeout exceptions for a long time. ``` 2020-05-28 17:21:14 INFO StreamThread:219 - stream-thread [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1] State transition from CREATED to STARTING 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'admin.retry.backoff.ms' was supplied but isn't a known config. 2020-05-28 17:21:14 INFO AppInfoParser:118 - Kafka commitId: 77a89fcf8d7fa018 2020-05-28 17:21:14 INFO AppInfoParser:117 - Kafka version: 2.4.0 2020-05-28 17:21:14 INFO KafkaConsumer:1032 - [Consumer clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1-consumer, groupId=consumer_group-101.public.user_storage] Subscribed to pattern: 'postgres_101.public.user_storage' 2020-05-28 17:21:14 INFO KafkaStreams:276 - stream-client [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7] State transition from CREATED to REBALANCING 2020-05-28 17:21:14 INFO StreamThread:664 - stream-thread [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1] Starting 2020-05-28 17:21:14 INFO AppInfoParser:119 - Kafka startTimeMs: 1590686474110 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'schema.registry.url' was supplied but isn't a known config. 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'admin.retries' was supplied but isn't a known config. "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. " 2020-05-28 17:23:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed 2020-05-28 17:25:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed "org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. " 2020-05-28 17:27:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. " 2020-05-28 17:29:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.``` # After app restart everything works fine The problem is that we nor can catch this exception and detect problem in order to automatically reboot app nor client can self-heal in this situatuon. Why could this happen and -- This message was sent by Atlassian Jira (v8.3.4#803005)