Hi. One possible solution I can imagine is to use replication throttle. https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas
You can set lower replication quota for other brokers -> failed broker traffic during catching-up to prevent exhausting storage throughput. 2020年12月15日(火) 20:56 Gokul Ramanan Subramanian <gokul24...@gmail.com>: > Hi. > > > When a broker crashes and restarts after a long time (say 30 minutes), > the broker has to do some work to sync all its replicas with the > leaders. If the broker is a preferred replica for some partition, the > broker may become the leader as part of a preferred replica leader > election, while it is still catching up on some other partitions. > > > This scenario can lead to a high incoming throughput on the broker > during the catch up phase and cause back pressure with certain storage > volumes (which have a fixed max throughput). This backpressure can > slow down recovery time, and manifest in the form of client > application errors in producing / consuming data on / from the > recovering broker. > > > I am looking for solutions to mitigate this problem. There are two > solutions that I am aware of. > > > 1. Scale out the cluster to have more brokers, so that the replication > traffic is smaller per broker during recovery. > > > 2. Keep preferred replica leader elections disabled and manually run > one instance of preferred replica leader election after the broker has > come back up and fully caught up. > > > Are there other solutions? > -- ======================== Okada Haruki ocadar...@gmail.com ========================