Re: Improving crash-restart catchup speed

Gokul Ramanan Subramanian Fri, 18 Dec 2020 04:05:26 -0800

Thanks. Makes sense.

On Wed, Dec 16, 2020 at 2:02 AM Haruki Okada <ocadar...@gmail.com> wrote:


> Hi.
> One possible solution I can imagine is to use replication throttle.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas
>
> You can set lower replication quota for other brokers -> failed broker
> traffic during catching-up to prevent exhausting storage throughput.
>
> 2020年12月15日(火) 20:56 Gokul Ramanan Subramanian <gokul24...@gmail.com>:
>
> > Hi.
> >
> >
> > When a broker crashes and restarts after a long time (say 30 minutes),
> > the broker has to do some work to sync all its replicas with the
> > leaders. If the broker is a preferred replica for some partition, the
> > broker may become the leader as part of a preferred replica leader
> > election, while it is still catching up on some other partitions.
> >
> >
> > This scenario can lead to a high incoming throughput on the broker
> > during the catch up phase and cause back pressure with certain storage
> > volumes (which have a fixed max throughput). This backpressure can
> > slow down recovery time, and manifest in the form of client
> > application errors in producing / consuming data on / from the
> > recovering broker.
> >
> >
> > I am looking for solutions to mitigate this problem. There are two
> > solutions that I am aware of.
> >
> >
> > 1. Scale out the cluster to have more brokers, so that the replication
> > traffic is smaller per broker during recovery.
> >
> >
> > 2. Keep preferred replica leader elections disabled and manually run
> > one instance of preferred replica leader election after the broker has
> > come back up and fully caught up.
> >
> >
> > Are there other solutions?
> >
>
>
> --
> ========================
> Okada Haruki
> ocadar...@gmail.com
> ========================
>

Re: Improving crash-restart catchup speed

Reply via email to