Dmitry Mischenko created KAFKA-10075:
----------------------------------------

             Summary: Kafka client stucks after Kafka-cluster unavailability
                 Key: KAFKA-10075
                 URL: https://issues.apache.org/jira/browse/KAFKA-10075
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 2.4.0
         Environment: Kafka v2.3.1 deployed by https://strimzi.io/ to 
Kubernetes cluster
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

            Reporter: Dmitry Mischenko


Several times we got an issue with kafka-client.

What happened:

We have Kafka v2.3.1 deployed by [https://strimzi.io/] to Kubernetes cluster 
(Amazon EKS). 
 # Kafka brokers were unavailable (due to cluster upgrade) and couldn't be 
resolved by internal hostnames
```2020-05-28 17:19:50 WARN  NetworkClient:962 - [Producer 
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
 Error connecting to node 
data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
 (id: -1 rack: null)2020-05-28 17:19:50 WARN  NetworkClient:962 - [Producer 
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
 Error connecting to node 
data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
 (id: -1 rack: null)at 
org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:289)at 
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at
 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)at 
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:538)at 
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at
 org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:335)" 
at java.base/java.net.InetAddress.getAllByName(Unknown Source)"at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)"
 at java.base/java.net.InetAddress.getAllByName(Unknown Source)"" at 
java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at 
org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104)at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)at
 java.base/java.net.InetAddress.getAllByName0(Unknown Source)at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:444)at
 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)at 
org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)at
 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)2020-05-28
 17:19:50 WARN  NetworkClient:962 - [Producer 
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
 Error connecting to node 
data-kafka-dev-kafka-1.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
 (id: -2 rack: null)at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:955)" 
at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)```


 # But after the moment when cluster was repaired, kafka-admin-client couldn't 
restore connection and only every 120s was throwing timeout exceptions for a 
long time.

``` 2020-05-28 17:21:14 INFO StreamThread:219 - stream-thread 
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1]
 State transition from CREATED to STARTING
 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 
'admin.retry.backoff.ms' was supplied but isn't a known config.
 2020-05-28 17:21:14 INFO AppInfoParser:118 - Kafka commitId: 77a89fcf8d7fa018
 2020-05-28 17:21:14 INFO AppInfoParser:117 - Kafka version: 2.4.0
 2020-05-28 17:21:14 INFO KafkaConsumer:1032 - [Consumer 
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1-consumer,
 groupId=consumer_group-101.public.user_storage] Subscribed to pattern: 
'postgres_101.public.user_storage'
 2020-05-28 17:21:14 INFO KafkaStreams:276 - stream-client 
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7] 
State transition from CREATED to REBALANCING
 2020-05-28 17:21:14 INFO StreamThread:664 - stream-thread 
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1]
 Starting
 2020-05-28 17:21:14 INFO AppInfoParser:119 - Kafka startTimeMs: 1590686474110
 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 
'schema.registry.url' was supplied but isn't a known config.
 2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 
'admin.retries' was supplied but isn't a known config.
 "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment.
 "
 2020-05-28 17:23:11 INFO AdminMetadataManager:238 - [AdminClient 
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
 Metadata update failed
 2020-05-28 17:25:11 INFO AdminMetadataManager:238 - [AdminClient 
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
 Metadata update failed
 "org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send 
the call.
 "
 2020-05-28 17:27:11 INFO AdminMetadataManager:238 - [AdminClient 
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
 Metadata update failed
 "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment.
 "
 2020-05-28 17:29:11 INFO AdminMetadataManager:238 - [AdminClient 
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
 Metadata update failed
 "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment.```


 # After app restart everything works fine
The problem is that we nor can catch this exception and detect problem in order 
to automatically reboot app nor client can self-heal in this situatuon.
Why could this happen and 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to