[ 
https://issues.apache.org/jira/browse/KAFKA-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Eisele updated KAFKA-6715:
------------------------------
    Attachment: 20180319-1756_kafka01-jvm-stack.dump

> Leader transition for all partitions lead by two brokers without visible 
> reason
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-6715
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6715
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, replication
>    Affects Versions: 0.11.0.2
>         Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances 
> with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes. 
> The cluster is distributed across 2 availability zones.
>            Reporter: Uwe Eisele
>            Priority: Critical
>         Attachments: 20180319-1756_kafka01-jvm-stack.dump
>
>
> In our cluster we experienced a situation, in which the leader of all 
> partitions lead by two brokers has been moved mainly to one other broker.
> We don't know why this happend. At this time there was not broker outage, nor 
> a broker shutdown has been initiated. The Zookeeper nodes of the affected 
> brokers (/brokers/ids/3, /brokers/ids/4) has not been modified during this 
> time.
> In addition there are no logs that would indicate a leader transition for the 
> affected brokers. We would expect to see a "{{sending become-leader 
> LeaderAndIsr request}}" in the controller log for each partition, as well a 
> "{{completed LeaderAndIsr request}}" in the state change log of the Kafka 
> brokers that becomes the new leader and follower. Our log level for the 
> kafka.controller and the state change log is set to TRACE.
> Though all Brokers are running, the situation does not recover. It sticks in 
> a highly imbalanced leader distribution, in which two brokers are no leader 
> for any partition, and one broker is the leader for almost all partitions.
> {code:java}
> kafka-controller Log (Level TRACE):
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 0.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 1.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 1.0 (kafka.controller.KafkaController)
> {code}
> The imbalance was recognized by the controller, but nothing happend.
> In addition it seems that the ReplicaFetcherThreads die without any log 
> message, though we think this is not possible... However, we would expect log 
> messages that state, that fetchers for partitions has been removed, as well 
> that the ReplicaFetcherThreads are shutting down. The log level for _kafka_ 
> is set to INFO. In other situations, when a broker is shuttdown we see such 
> entries in the log files.
> Besides that, this caused underreplicated partitions. It seems that no broker 
> fetches from the partitions with the newly assigned leaders. Like the 
> situation with the highly imbalanced leader distribution the cluster sticks 
> in this state and does not recover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to