[ 
https://issues.apache.org/jira/browse/KAFKA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443781#comment-17443781
 ] 

Edoardo Comar commented on KAFKA-13407:
---------------------------------------

[~vinsonZhang] are you suggesting loss of connection with ZK should not cause a 
controller to resign? 

> Kafka controller out of service after ZK leader restart
> -------------------------------------------------------
>
>                 Key: KAFKA-13407
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13407
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.8.1
>         Environment: Ubuntu 20.04
>            Reporter: Daniel
>            Priority: Critical
>
> When the Zookeeper leader disappears, a new instance becomes the leader, the 
> instances need to reconnect to Zookeeper, but the Kafka "Controller" gets 
> lost in limbo state after re-establishing connection.
> See below for how I manage to reproduce this over and over.
> *Prerequisites*
> Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which 
> one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the 
> `-L` flag.
> Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is 
> leader by checking
>  
> {code:java}
> echo stat | nc -v localhost 2181
> {code}
>  
>  
> *Reproduce*
> 1. Stop the leader Zookeeper service.
> 2. Watch the logs of the Kafka Controller and ensure that it reconnects and 
> registers again.
>  
> {code:java}
> Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO 
> Unable to read additional data from server sessionid 0x1f2a12870003, likely 
> server has closed socket, closing socket connection and attempting reconnect 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN 
> SASL configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it. (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO 
> Opening socket connection to server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR 
> [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO 
> Socket connection established, initiating session, client: 
> /10.10.85.215:39338, server: 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO 
> Session establishment complete on server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, 
> sessionid = 0x1f2a12870003, negotiated timeout = 18000 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO 
> [ZooKeeperClient Kafka server] Reinitializing due to auth failure. 
> (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [PartitionStateMachine controllerId=1003] Stopped partition state machine 
> (kafka.controller.ZkPartitionStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [ReplicaStateMachine controllerId=1003] Stopped replica state machine 
> (kafka.controller.ZkReplicaStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO 
> Processing notification(s) to /config/changes 
> (kafka.common.ZkNodeChangeNotificationListener)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO 
> [Controller id=1003] Resigned (kafka.controller.KafkaController)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO 
> Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO 
> Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, 
> epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). 
> (kafka.server.FinalizedFeatureCache)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO 
> Stat of the created znode at /brokers/ids/1003 is: 
> 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]:  (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO 
> Registered broker 1003 at path /brokers/ids/1003 with addresses: 
> PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092,
>  czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient)
> {code}
>  
>  
> Now it is in this limbo state but to notice the problem create a topic for 
> example.
>  
> {code:java}
> ./kafka-topics.sh --zookeeper 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr 
> --create --topic danols-ts-4 --partitions 10 --replication-factor 3
> Created topic danols-ts-4.
> {code}
>  
>  
> Then when describing the topic it looks like this:
>  
> {code:java}
> ./kafka-topics.sh --zookeeper 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr 
> --describe --topic danols-ts-4
> Topic: danols-ts-4    PartitionCount: 10      ReplicationFactor: 3    
> Configs: 
>       Topic: danols-ts-4      Partition: 0    Leader: none    Replicas: 
> 1003,1001,1002        Isr: 
>       Topic: danols-ts-4      Partition: 1    Leader: none    Replicas: 
> 1001,1002,1003        Isr: 
>       Topic: danols-ts-4      Partition: 2    Leader: none    Replicas: 
> 1002,1003,1001        Isr: 
>       Topic: danols-ts-4      Partition: 3    Leader: none    Replicas: 
> 1003,1002,1001        Isr: 
>       Topic: danols-ts-4      Partition: 4    Leader: none    Replicas: 
> 1001,1003,1002        Isr: 
>       Topic: danols-ts-4      Partition: 5    Leader: none    Replicas: 
> 1002,1001,1003        Isr: 
>       Topic: danols-ts-4      Partition: 6    Leader: none    Replicas: 
> 1003,1001,1002        Isr: 
>       Topic: danols-ts-4      Partition: 7    Leader: none    Replicas: 
> 1001,1002,1003        Isr: 
>       Topic: danols-ts-4      Partition: 8    Leader: none    Replicas: 
> 1002,1003,1001        Isr: 
>       Topic: danols-ts-4      Partition: 9    Leader: none    Replicas: 
> 1003,1002,1001        Isr: 
> {code}
>  
>  
> If I describe using the --bootstrap-server flag instead it doesn't exist
> {code:java}
> ./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic 
> danols-ts-4
> Error while executing topic command : Topic 'danols-ts-4' does not exist as 
> expected
> [2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic 
> 'danols-ts-4' does not exist as expected
>       at 
> kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542)
>       at 
> kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317)
>       at kafka.admin.TopicCommand$.main(TopicCommand.scala:69)
>       at kafka.admin.TopicCommand.main(TopicCommand.scala)
>  (kafka.admin.TopicCommand$)
> {code}
> If I restart the Kafka service on a broker that is *not* "Controller" it 
> doesn't come up again. It just spams the following in the log for all 
> different partitions and it's not possible to connect to it.
> {code:java}
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 0.
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO 
> [Admin Manager on Broker 1001]: Error processing create topic request 
> CreatableTopic(name='example-message-consumer', numPartitions=10, 
> replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager)
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 0.
> Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO 
> [Admin Manager on Broker 1001]: Error processing create topic request 
> CreatableTopic(name='__consumer_offsets', numPartitions=50, 
> replicationFactor=3, assignments=[], 
> configs=[CreateableTopicConfig(name='compression.type', value='producer'), 
> CreateableTopicConfig(name='cleanup.policy', value='compact'), 
> CreateableTopicConfig(name='segment.bytes', value='104857600')]) 
> (kafka.server.ZkAdminManager)
> {code}
> To get out of this state I simply restart the Kafka service on the 
> "Controller". It will make another become "Controller" and all brokers 
> becomes OK again and the newly created topic gets all expected ISRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to