Krzysztof Piecuch created KAFKA-12513:
-----------------------------------------

             Summary: Kafka zookeeper client can't connect when the first 
zookeeper server is offline
                 Key: KAFKA-12513
                 URL: https://issues.apache.org/jira/browse/KAFKA-12513
             Project: Kafka
          Issue Type: Bug
          Components: zkclient
    Affects Versions: 2.7.0, 2.4.1, 2.3.1
         Environment: kafka_2.13-2.7.0, kernel 5.4.0-52-generic (Ubuntu), Scala 
2.13.3-400
            Reporter: Krzysztof Piecuch


Kafka zookeeper client library will not connect to any zookeepers in the 
"zookeeper string" when the first zookeeper is offline. This causes the cluster 
to crash hard and in order to get the cluster back into healthy state the first 
zookeeper node must be resurrected.

The crash does not always happen immediately after zk0 goes offline, because 
kafka might have connections established to different zookeeper instances. When 
the connection gets dropped and kafka needs to reconnect everything crashes 
hard.

 

Demo:

This works:
{code:java}
 root@kafka0:/opt/kafka/current/bin# ./kafka-topics.sh --zookeeper 
zk0.gambit:2181/hex8c,zk1.gambit:2181/hex8c,zk2.gambit:2181/hex8c --describe  
--topic duma
Topic: duma     PartitionCount: 6       ReplicationFactor: 3    Configs: 
compression.type=uncompressed,retention.bytes=322122547200
        Topic: duma     Partition: 0    Leader: 1       Replicas: 1,0,2 Isr: 
1,0,2
        Topic: duma     Partition: 1    Leader: 2       Replicas: 2,1,0 Isr: 
0,1,2
        Topic: duma     Partition: 2    Leader: 0       Replicas: 0,2,1 Isr: 
0,1,2
        Topic: duma     Partition: 3    Leader: 1       Replicas: 1,2,0 Isr: 
1,0,2
        Topic: duma     Partition: 4    Leader: 2       Replicas: 2,0,1 Isr: 
1,0,2
        Topic: duma     Partition: 5    Leader: 0       Replicas: 0,1,2 Isr: 
0,1,2
{code}
Now let's mess with the zookeeper string and see how zookeeper client reacts:

Changing the last server in the zookeeper string works as expected, 
{{kafka-topics.sh}} connected to zookeeper but couldn't find the topic (because 
of bogus zookeeper string):
{code:java}
root@kafka0:/opt/kafka/current/bin# ./kafka-topics.sh --zookeeper 
zk0.gambit:2181/hex8c,zk1.gambit:2181/hex8c,1.1.1.1:2181/hex8c --describe 
--topic duma
Error while executing topic command : Topic 'duma' does not exist as expected
[2021-03-20 23:01:45,535] ERROR java.lang.IllegalArgumentException: Topic 
'duma' does not exist as expected
        at 
kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:484)
        at 
kafka.admin.TopicCommand$ZookeeperTopicService.describeTopic(TopicCommand.scala:390)
        at kafka.admin.TopicCommand$.main(TopicCommand.scala:67)
        at kafka.admin.TopicCommand.main(TopicCommand.scala)
 (kafka.admin.TopicCommand$) {code}
However, in case the first server in the zookeeper cluster is unavailable 
zookeeper client won't connect to any of the zookeepers:
{code:java}
root@kafka0:/opt/kafka/current/bin# ./kafka-topics.sh --zookeeper 
1.1.1.1:2181/hex8c,zk1.gambit:2181/hex8c,zk2.gambit:2181/hex8c --describe 
--topic duma
[2021-03-20 23:02:43,888] WARN Client session timed out, have not heard from 
server in 30012ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
Exception in thread "main" kafka.zookeeper.ZooKeeperClientTimeoutException: 
Timed out waiting for connection while in state: CONNECTING
        at 
kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:259)
        at 
kafka.zookeeper.ZooKeeperClient$$Lambda$31.000000005D399170.apply$mcV$sp(Unknown
 Source)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
        at 
kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:255)
        at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:113)
        at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1858)
        at 
kafka.admin.TopicCommand$ZookeeperTopicService$.apply(TopicCommand.scala:321)
        at kafka.admin.TopicCommand$.main(TopicCommand.scala:54)
        at kafka.admin.TopicCommand.main(TopicCommand.scala) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to