Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Martin Dickson via dev Wed, 17 Dec 2025 02:53:29 -0800

Sounds good, thanks for filing the ticket and for the KIP in general, we're
looking forward to the leader election improvements that will come with the
remainder!


On Mon, Dec 15, 2025 at 6:59 PM Calvin Liu via dev <[email protected]>
wrote:

> Hi Martin,
> Thanks for the report, this flapping behavior sounds like something we
> should address. I opened a ticket to track the progress
> https://issues.apache.org/jira/browse/KAFKA-19996.
> Thanks again!
>
> On Fri, Nov 28, 2025 at 2:04 AM Martin Dickson via dev <
> [email protected]>
> wrote:
>
> > Hi all,
> >
> > I wanted to check the expected behaviour of this KIP in the following
> case.
> > Assume you have a partition of topic with minISR=2, RF=3. The leader is
> > receiving a lot of traffic on that partition, so much that followers
> can't
> > keep up. One follower is booted out of ISR as normal, but currently the
> > second follower flaps in and out of ISR. This is because ISR expand /
> > shrink perform different checks:
> >
> > - Shrink ISR, attempted on a schedule, triggers if the follower's LEO has
> > not recently caught up with the leader's LEO (see `isCaughtUp(...)` in
> > `Replica.scala`)
> > - Expand ISR, attempted on processing follower fetch, triggers if the
> > follower's LEO caught up with the leader's HWM in this fetch (via
> > `isFollowerInSync(...)` in `Partition.scala`)
> >
> > and both of these will always trigger: shrink because the follower's LEO
> > will be way behind because it has way less data, but also expand because
> in
> > this case the leader's HWM is equal to follower's LEO. Expand triggering
> is
> > new behaviour with KIP-966, the `isFollowerInSync(...)` code itself is
> > unchanged but the behaviour is different with the change in HWM
> advancement
> > logic. This causes the flapping which means load on the control plane
> (for
> > each such partition we trigger two requests in every iteration of the
> > "isr-expiration" schedule).
> >
> > We noticed this with a producer using acks=1 on a minISR=2 topic (which,
> > FWIW, we consider a misconfiguration). Due to acks=1 we are able to pile
> up
> > a large amount of data on the leader. The HWM is advancing slowly
> (dictated
> > by the follower's speed), but arguably the HWM should not be advancing at
> > all because the follower is lagging significantly. With acks=all I
> believe
> > the behaviour would be different, we'd backpressure the producer (either
> > via the replication speed or via the partition becoming RO) and wouldn't
> be
> > able to accumulate so much unreplicated data on the leader.
> >
> > An alternative would be to change the implementation of "expand" to also
> > use the leader's LEO. This would be more in line with the
> > replica.lag.max.time.ms documentation. I can imagine we don't want to
> > over-index on the behaviour of acks=1/minISR=2 examples, but on the other
> > hand I'm not immediately seeing why shrink and expand should perform
> > different checks in general.
> >
> > Thanks,
> > Martin
> >
> > On 2023/08/10 22:46:52 Calvin Liu wrote:
> > > Hi everyone,
> > > I'd like to discuss a series of enhancement to the replication
> protocol.
> > >
> > > A partition replica can experience local data loss in unclean shutdown
> > > scenarios where unflushed data in the OS page cache is lost - such as
> an
> > > availability zone power outage or a server error. The Kafka replication
> > > protocol is designed to handle these situations by removing such
> replicas
> > > from the ISR and only re-adding them once they have caught up and
> > therefore
> > > recovered any lost data. This prevents replicas that lost an arbitrary
> > log
> > > suffix, which included committed data, from being elected leader.
> > > However, there is a "last replica standing" state which when combined
> > with
> > > a data loss unclean shutdown event can turn a local data loss scenario
> > into
> > > a global data loss scenario, i.e., committed data can be removed from
> all
> > > replicas. When the last replica in the ISR experiences an unclean
> > shutdown
> > > and loses committed data, it will be reelected leader after starting up
> > > again, causing rejoining followers to truncate their logs and thereby
> > > removing the last copies of the committed records which the leader lost
> > > initially.
> > >
> > > The new KIP will maximize the protection and provides MinISR-1
> tolerance
> > to
> > > data loss unclean shutdown events.
> > >
> > >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> > >
> >
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to