One possible solution I can imagine is to use replication throttle.

You can set lower replication quota for other brokers -> failed broker
traffic during catching-up to prevent exhausting storage throughput.

2020年12月15日(火) 20:56 Gokul Ramanan Subramanian <gokul24...@gmail.com>:

> Hi.
> When a broker crashes and restarts after a long time (say 30 minutes),
> the broker has to do some work to sync all its replicas with the
> leaders. If the broker is a preferred replica for some partition, the
> broker may become the leader as part of a preferred replica leader
> election, while it is still catching up on some other partitions.
> This scenario can lead to a high incoming throughput on the broker
> during the catch up phase and cause back pressure with certain storage
> volumes (which have a fixed max throughput). This backpressure can
> slow down recovery time, and manifest in the form of client
> application errors in producing / consuming data on / from the
> recovering broker.
> I am looking for solutions to mitigate this problem. There are two
> solutions that I am aware of.
> 1. Scale out the cluster to have more brokers, so that the replication
> traffic is smaller per broker during recovery.
> 2. Keep preferred replica leader elections disabled and manually run
> one instance of preferred replica leader election after the broker has
> come back up and fully caught up.
> Are there other solutions?

Okada Haruki

Reply via email to