[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

Evan Williams (Jira) Tue, 04 Feb 2020 00:26:55 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029638#comment-17029638
 ]


Evan Williams commented on KAFKA-4084:
--------------------------------------

[~sql_consulting] Many thanks for the explanation.

Is there any logic I can implement at all now, to try and solve this in the 
meantime, until KIP-491 is merged ? Currently we are on Confluent 5.4 (open 
source). My only thought, was to programmatically set 
auto.leader.rebalance.enable=true, so that it at least doesn't become leader to 
serve any data - and be in a diminished state under too much load (causing 
streams clients to fail/crash). And at the appropriate time, switch it back to 
false, and wait for it to become leader again. 

What we are seeing, almost regardless of 
num.io.threads/num.network.threads/num.replica.fetchers (ie set to 1/3/1) is a 
empty broker, will go straight to 100% CPU due to replicafetcher threads, even 
when it's not the leader of any partitions. (I have a feeling that if 
replica.fetch.response.max.bytes is not tuned correctly, this may have some 
effect ?)

Could bandwidth quota's on the broker level help out that situation ?

leader.replication.throttled.rate
 follower.replication.throttled.rate

Or for those to have any consequence, does the bandwidth quotas also need to be 
set at the topic level ?

leader.replication.throttled.replicas
 follower.replication.throttled.replicas

 

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4084
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Tom Crayford
>            Priority: Major
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

Reply via email to