Re: [DISCUSS] Single broker failures causing offline partitions

Justine Olshan Tue, 10 Sep 2024 09:14:17 -0700

Hey Martin.

I recommend you take a look at KIP-966. I think can help the use case you
are describing.
The KIP talks about failure scenarios, but I believe it will also help when
the leader has issues and kicks its followers out of ISR.
The goal is to better handle the "last replica standing" issue


https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas

Let me know if it helps,

Justine

On Tue, Sep 10, 2024 at 9:00 AM Martin Dickson
<martin.dick...@datadoghq.com.invalid> wrote:

> Hi all,
>
> We have a recurring issue with single broker failures causing offline
> partitions. The issue is that when a leader is degraded, follower fetches
> can fail to happen in a timely manner, and all followers can fall out of
> sync. If that leader then later fails then the partition will go offline,
> but even if it remains only partially failed then applications might still
> be impacted (for example, if the producer is using acks=all and
> min.insync.replicas=2). This can all happen because of a problem solely
> with the leader, and hence a single broker failure can cause
> unavailability, even if RF=3 or higher.
>
> We’ve seen the issue with various kinds of failures, mostly related to
> failing disks, e.g. due to pressure on request handler threads as a result
> of produce requests waiting on a slow disk. But the easiest way for us to
> reproduce it is at the outgoing network level: Setting up a cluster with
> moderate levels of ingoing throughput then injecting 50% outgoing packet
> drop on a single broker is enough to cause the partitions to cause follower
> requests to be slow and replication to lag, but not enough for that broker
> to lose its connection to ZK. This triggers the degraded broker to become
> the only member of ISR.
>
> We have replica.lag.time.max.ms=10000 and zookeeper.session.timeout.ms
> =6000
> (the pre-KIP-537 values, 1/3 of the current defaults, to control produce
> latency when a follower is failing). We are also able to reproduce the
> issue in the same way on a KRaft cluster with the KRaft defaults. (Note
> that we are not very experienced with operating KRaft as we aren’t running
> it in production yet.)
>
> The last KIP I saw regarding this was KIP-501
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-501+Avoid+out-of-sync+or+offline+partitions+when+follower+fetch+requests+are+not+processed+in+time
> >,
> which describes this exact problem. The proposed solution there was in the
> first part to introduce a notion of pending requests, and second part to
> relinquish leadership if pending requests are taking too long. The
> discussion
> thread <https://lists.apache.org/thread/1kbs68dq60p31frpfsr3x1vcqlzjf60x>
> for that doesn’t come to a conclusion. However it is pointed out that not
> all failure modes would be solved by the pending requests approach, and
> that whilst relinquishing leadership seems ideal there are concerns about
> this thrashing in certain failure modes.
>
> We are experimenting with a variation on KIP-501 where we add a heuristic
> for brokers failing this way: if the leader for a partition has removed
> many followers from ISR in a short period of time (including the case when
> it sends a single AlterPartition request removing all followers from ISR
> and thereby shrinking ISR only to itself), have the controller ignore this
> request and instead choose one of the followers to become leader. To avoid
> thrashing, rate-limit how often the controller can do this per
> topic-partition. We have tested that this fixes our repro, but have not
> productionised it (see rough draft PR
> <https://github.com/DataDog/kafka/pull/15/files>). We have only
> implemented
> ZK-mode so far. We implemented this on the controller side out of
> convenience (no API changes), but potentially the demotion decision should
> be taken at the broker level instead, which should also be possible.
>
> Whilst the code change is small, the proposed solution we’re investigating
> isn’t very clean and we’re not totally satisfied with it. We wanted to get
> some ideas from the community on:
> 1. How are other folks handling this class of issues?
> 2. Is there any interest in adding more comprehensive failing
> broker detection to Kafka (particularly how this could look in KRaft)?
> 3. Is there any interest in having a heuristic failure detection like the
> one described above?
>
> Thanks,
> Martin
>

Re: [DISCUSS] Single broker failures causing offline partitions

Reply via email to