[ 
https://issues.apache.org/jira/browse/KAFKA-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Eisele updated KAFKA-6715:
------------------------------
    Description: 
In our cluster we experienced a situation, in which the leader of all 
partitions lead by two brokers has been moved mainly to one other broker.

We don't know why this happend. At this time there was not broker outage, nor a 
broker shutdown has been initiated. The Zookeeper nodes of the affected brokers 
(/brokers/ids/3, /brokers/ids/4) has not been modified during this time.

In addition there are no logs that would indicate a leader transition for the 
affected brokers. We would expect to see a "{{sending become-leader 
LeaderAndIsr request}}" in the controller log for each partition, as well a 
"{{completed LeaderAndIsr request}}" in the state change log of the Kafka 
brokers that becomes the new leader and follower. Our log level for the 
kafka.controller and the state change log is set to TRACE.

Though all Brokers are running, the situation does not recover. It sticks in a 
highly imbalanced leader distribution, in which two brokers are no leader for 
any partition, and one broker is the leader for almost all partitions.
{code:java}
kafka-controller Log (Level TRACE):
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 0.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 1.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 1.0 (kafka.controller.KafkaController)
{code}
The imbalance was recognized by the controller, but nothing happend.

In addition it seems that the ReplicaFetcherThreads die without any log message 
(see attached stack trace), though we think this is not possible... However, we 
would expect log messages that state, that fetchers for partitions have been 
removed, as well that the ReplicaFetcherThreads are shutting down. The log 
level for _kafka_ is set to INFO. In other situations, when a broker is shut 
down we see such entries in the log files.

Besides that, this caused underreplicated partitions. It seems that no broker 
fetches from the partitions with the newly assigned leaders. Like the situation 
with the highly imbalanced leader distribution the cluster sticks in this state 
and does not recover.

This is a recurring problem, however we cannot reproduce it.

  was:
In our cluster we experienced a situation, in which the leader of all 
partitions lead by two brokers has been moved mainly to one other broker.

We don't know why this happend. At this time there was not broker outage, nor a 
broker shutdown has been initiated. The Zookeeper nodes of the affected brokers 
(/brokers/ids/3, /brokers/ids/4) has not been modified during this time.

In addition there are no logs that would indicate a leader transition for the 
affected brokers. We would expect to see a "{{sending become-leader 
LeaderAndIsr request}}" in the controller log for each partition, as well a 
"{{completed LeaderAndIsr request}}" in the state change log of the Kafka 
brokers that becomes the new leader and follower. Our log level for the 
kafka.controller and the state change log is set to TRACE.

Though all Brokers are running, the situation does not recover. It sticks in a 
highly imbalanced leader distribution, in which two brokers are no leader for 
any partition, and one broker is the leader for almost all partitions.
{code:java}
kafka-controller Log (Level TRACE):
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 0.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 1.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for 
broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for 
broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for 
broker 4 is 1.0 (kafka.controller.KafkaController)
{code}
The imbalance was recognized by the controller, but nothing happend.

In addition it seems that the ReplicaFetcherThreads die without any log message 
(see attached stack trace), though we think this is not possible... However, we 
would expect log messages that state, that fetchers for partitions have been 
removed, as well that the ReplicaFetcherThreads are shutting down. The log 
level for _kafka_ is set to INFO. In other situations, when a broker is shut 
down we see such entries in the log files.

Besides that, this caused underreplicated partitions. It seems that no broker 
fetches from the partitions with the newly assigned leaders. Like the situation 
with the highly imbalanced leader distribution the cluster sticks in this state 
and does not recover.


> Leader transition for all partitions lead by two brokers without visible 
> reason
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-6715
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6715
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, replication
>    Affects Versions: 0.11.0.2
>         Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances 
> with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes. 
> The cluster is distributed across 2 availability zones.
>            Reporter: Uwe Eisele
>            Priority: Critical
>         Attachments: 20180319-1756_kafka01-jvm-stack.dump
>
>
> In our cluster we experienced a situation, in which the leader of all 
> partitions lead by two brokers has been moved mainly to one other broker.
> We don't know why this happend. At this time there was not broker outage, nor 
> a broker shutdown has been initiated. The Zookeeper nodes of the affected 
> brokers (/brokers/ids/3, /brokers/ids/4) has not been modified during this 
> time.
> In addition there are no logs that would indicate a leader transition for the 
> affected brokers. We would expect to see a "{{sending become-leader 
> LeaderAndIsr request}}" in the controller log for each partition, as well a 
> "{{completed LeaderAndIsr request}}" in the state change log of the Kafka 
> brokers that becomes the new leader and follower. Our log level for the 
> kafka.controller and the state change log is set to TRACE.
> Though all Brokers are running, the situation does not recover. It sticks in 
> a highly imbalanced leader distribution, in which two brokers are no leader 
> for any partition, and one broker is the leader for almost all partitions.
> {code:java}
> kafka-controller Log (Level TRACE):
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 0.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 1.0 (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 3 is 1.0 (kafka.controller.KafkaController)
> [2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for 
> broker 4 is 1.0 (kafka.controller.KafkaController)
> {code}
> The imbalance was recognized by the controller, but nothing happend.
> In addition it seems that the ReplicaFetcherThreads die without any log 
> message (see attached stack trace), though we think this is not possible... 
> However, we would expect log messages that state, that fetchers for 
> partitions have been removed, as well that the ReplicaFetcherThreads are 
> shutting down. The log level for _kafka_ is set to INFO. In other situations, 
> when a broker is shut down we see such entries in the log files.
> Besides that, this caused underreplicated partitions. It seems that no broker 
> fetches from the partitions with the newly assigned leaders. Like the 
> situation with the highly imbalanced leader distribution the cluster sticks 
> in this state and does not recover.
> This is a recurring problem, however we cannot reproduce it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to