Uwe Eisele created KAFKA-6714: --------------------------------- Summary: KafkaController marks all Brokers as "Shutting down", though only one broker has been shut down Key: KAFKA-6714 URL: https://issues.apache.org/jira/browse/KAFKA-6714 Project: Kafka Issue Type: Bug Components: controller, core Affects Versions: 0.11.0.2 Environment: Kafka Cluster on Amazon AWS EC2 r4.2xlarge instances with 5 nodes and a Zookeeper Cluster on r4.2xlarge instances with 3 nodes. The Cluster is distributed across 2 availability zones. Reporter: Uwe Eisele
In our Kafka Cluster we experienced a situation in wich the Kafka controller has all Brokers marked as "Shutting down", though indeed only one Broker has been shut down. The last log entry about the broker state before the entry that states that all brokers are shutting down states that no brokers are shutting down. The consequence of this weird state is, that the Kafka controller is not able to elect any partition leader. {code:java} [2018-03-15 16:28:24,288] INFO [Controller 5]: Shutting down broker 5 (kafka.controller.KafkaController) [2018-03-15 16:28:24,288] DEBUG [Controller 5]: All shutting down brokers: 5 (kafka.controller.KafkaController) [2018-03-15 16:28:24,288] DEBUG [Controller 5]: Live brokers: 1,2,3,4 (kafka.controller.KafkaController) ... [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently active brokers in the cluster: Set(1, 2, 3, 4) (kafka.controller.KafkaController) [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently shutting brokers in the cluster: Set() (kafka.controller.KafkaController) ... [2018-03-19 17:57:22,273] INFO [Controller 3]: Shutting down broker 1 (kafka.controller.KafkaController) [2018-03-19 17:57:22,273] DEBUG [Controller 3]: All shutting down brokers: 1,5,2,3,4 (kafka.controller.KafkaController) [2018-03-19 17:57:22,273] DEBUG [Controller 3]: Live brokers: (kafka.controller.KafkaController) ... [2018-03-19 17:57:22,275] ERROR Controller 3 epoch 83 encountered error while electing leader for partition [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] due to: No other replicas in ISR 1,3,5 for [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] besides shutting down brokers 1,5,2,3,4. (state.change.logger) {code} The question is why the Kafka controller assumes that all brokers are shutting down? The only place in the Kafka code (0.11.0.2) we found in which the shutting down broker set is changed is in the class _kafka.controller.KafkaControler_ in line 1407 in the method _doControlledShutdown_. {code:java} info("Shutting down broker " + id) if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id)) throw new BrokerNotAvailableException("Broker id %d does not exist.".format(id)) controllerContext.shuttingDownBrokerIds.add(id) {code} However, we should see the log entry "Shutting down broker n" for all Brokers in the log file, but it is not there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)