[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

GEORGE LI (Jira) Wed, 08 Apr 2020 02:08:54 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077958#comment-17077958
 ]


GEORGE LI commented on KAFKA-4084:
----------------------------------

[~blodsbror]

Probably doing PLE with too many partitions at once is not good.  We have 
scripted to take all  partition with Preferred Leader Imbalance.  (e.g.  
current leader != first replica).   and the first replica is in ISR.  

Then we divide it  batches (e.g.  100 partitions per batch. and throttle sleep 
about 5-10 seconds) between each batch.   We also verify each batch after 
submitting for PLE.   e.g.  the ZK node.  
/<cluster_name>/admin/preferred_replica_election  is gone.  


for KIP-491 patch,  maybe I should write a wrapper for doing PLE,   because now 
the logic is not just.  current_leader != first replica.  but:  current_leader 
!= <preferred_replica_after_deprioritized_logic>  


The batch logic is basically  writing the topic/partitions into a Json file 
(e.g. 100 per batch), and the submit that batch using the open source script 
`kafka-preferred-replica-election.sh` ,  below is shell script to do PLE for 
one topic (all partitions).  It's still using ZK to submit the json,  can 
change to --bootstrap-server 


{code}
$ cat topic_preferred_leader_election.sh
.....
name=$1
topic=$2
kafka_cluster_name="${name}"
zk=$(kafka_zk_lookup ${kafka_cluster_name})
json_filename="${name}_${topic}_leader_election.json"

touch ${json_filename}

echo "{\"partitions\":[" >${json_filename}
IFS=$'\n'
for partition in `/usr/lib/kafka/bin/kafka-run-class.sh 
kafka.admin.TopicCommand --zookeeper $zk --describe --topic $topic  2>/dev/null 
|grep Partition:|awk -F "Partition:" '{print $2}'|awk '{print $1}'`
do
  if [ "$partition" == "0" ]
  then
    echo " {\"topic\": \"${topic}\", \"partition\": ${partition}}" 
>>${json_filename}
  else
    echo ",{\"topic\": \"${topic}\", \"partition\": ${partition}}" 
>>${json_filename}
  fi
done

echo "]}" >>${json_filename}

/usr/lib/kafka/bin/kafka-preferred-replica-election.sh --zookeeper $zk 
--path-to-json-file ${json_filename} 2>/dev/null
#rm ${json_filename}
{code}


for the troubleshooting of the timeout,  maybe check the ZK node:  
/<cluster_name>/admin/preferred_replica_election and see any pending PLE there. 
maybe because of the KIP-491 Preferred Leader deprioritized/black list?  I 
doubt, because I have tested it worked.   does this PLE work before applying 
KIP-491 patch? 

I think Zookeeper node has size limit of 1MB.   so 5000-6000 partitions doing 
PLE all together in one batch might not work.   How about trying one topic 
first, then try 100 in a batch?    

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4084
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Tom Crayford
>            Priority: Major
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

Reply via email to