[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077958#comment-17077958 ]
GEORGE LI commented on KAFKA-4084: ---------------------------------- [~blodsbror] Probably doing PLE with too many partitions at once is not good. We have scripted to take all partition with Preferred Leader Imbalance. (e.g. current leader != first replica). and the first replica is in ISR. Then we divide it batches (e.g. 100 partitions per batch. and throttle sleep about 5-10 seconds) between each batch. We also verify each batch after submitting for PLE. e.g. the ZK node. /<cluster_name>/admin/preferred_replica_election is gone. for KIP-491 patch, maybe I should write a wrapper for doing PLE, because now the logic is not just. current_leader != first replica. but: current_leader != <preferred_replica_after_deprioritized_logic> The batch logic is basically writing the topic/partitions into a Json file (e.g. 100 per batch), and the submit that batch using the open source script `kafka-preferred-replica-election.sh` , below is shell script to do PLE for one topic (all partitions). It's still using ZK to submit the json, can change to --bootstrap-server {code} $ cat topic_preferred_leader_election.sh ..... name=$1 topic=$2 kafka_cluster_name="${name}" zk=$(kafka_zk_lookup ${kafka_cluster_name}) json_filename="${name}_${topic}_leader_election.json" touch ${json_filename} echo "{\"partitions\":[" >${json_filename} IFS=$'\n' for partition in `/usr/lib/kafka/bin/kafka-run-class.sh kafka.admin.TopicCommand --zookeeper $zk --describe --topic $topic 2>/dev/null |grep Partition:|awk -F "Partition:" '{print $2}'|awk '{print $1}'` do if [ "$partition" == "0" ] then echo " {\"topic\": \"${topic}\", \"partition\": ${partition}}" >>${json_filename} else echo ",{\"topic\": \"${topic}\", \"partition\": ${partition}}" >>${json_filename} fi done echo "]}" >>${json_filename} /usr/lib/kafka/bin/kafka-preferred-replica-election.sh --zookeeper $zk --path-to-json-file ${json_filename} 2>/dev/null #rm ${json_filename} {code} for the troubleshooting of the timeout, maybe check the ZK node: /<cluster_name>/admin/preferred_replica_election and see any pending PLE there. maybe because of the KIP-491 Preferred Leader deprioritized/black list? I doubt, because I have tested it worked. does this PLE work before applying KIP-491 patch? I think Zookeeper node has size limit of 1MB. so 5000-6000 partitions doing PLE all together in one batch might not work. How about trying one topic first, then try 100 in a batch? > automated leader rebalance causes replication downtime for clusters with too > many partitions > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Tom Crayford > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)