[
https://issues.apache.org/jira/browse/KAFKA-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Eisele updated KAFKA-6715:
------------------------------
Attachment: 20180319-1756_kafka01-jvm-stack.dump
> Leader transition for all partitions lead by two brokers without visible
> reason
> -------------------------------------------------------------------------------
>
> Key: KAFKA-6715
> URL: https://issues.apache.org/jira/browse/KAFKA-6715
> Project: Kafka
> Issue Type: Bug
> Components: core, replication
> Affects Versions: 0.11.0.2
> Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances
> with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes.
> The cluster is distributed across 2 availability zones.
> Reporter: Uwe Eisele
> Priority: Critical
> Attachments: 20180319-1756_kafka01-jvm-stack.dump
>
>
> In our cluster we experienced a situation, in which the leader of all
> partitions lead by two brokers has been moved mainly to one other broker.
> We don't know why this happend. At this time there was not broker outage, nor
> a broker shutdown has been initiated. The Zookeeper nodes of the affected
> brokers (/brokers/ids/3, /brokers/ids/4) has not been modified during this
> time.
> In addition there are no logs that would indicate a leader transition for the
> affected brokers. We would expect to see a "{{sending become-leader
> LeaderAndIsr request}}" in the controller log for each partition, as well a
> "{{completed LeaderAndIsr request}}" in the state change log of the Kafka
> brokers that becomes the new leader and follower. Our log level for the
> kafka.controller and the state change log is set to TRACE.
> Though all Brokers are running, the situation does not recover. It sticks in
> a highly imbalanced leader distribution, in which two brokers are no leader
> for any partition, and one broker is the leader for almost all partitions.
> {code:java}
> kafka-controller Log (Level TRACE):
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for
> broker 5 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for
> broker 2 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for
> broker 3 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for
> broker 4 is 0.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for
> broker 4 is 1.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for
> broker 4 is 1.0 (kafka.controller.KafkaController)
> {code}
> The imbalance was recognized by the controller, but nothing happend.
> In addition it seems that the ReplicaFetcherThreads die without any log
> message, though we think this is not possible... However, we would expect log
> messages that state, that fetchers for partitions has been removed, as well
> that the ReplicaFetcherThreads are shutting down. The log level for _kafka_
> is set to INFO. In other situations, when a broker is shuttdown we see such
> entries in the log files.
> Besides that, this caused underreplicated partitions. It seems that no broker
> fetches from the partitions with the newly assigned leaders. Like the
> situation with the highly imbalanced leader distribution the cluster sticks
> in this state and does not recover.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)