Hi Manan,

Thanks for the feedback & discussion on this.

We are thinking a time/bytes-based approach will work best here. The
issue with the offset-based approach is that it doesn't scale well
across different topic throughput rates. For high-throughput topics,
the offset gap between existing and newly-bootstrapped replicas can be
millions of messages, and convergence time approaches
local.retention.ms (potentially hours) since the gap only closes as
the old replica's local retention truncates segments (and eventually
is within the offset range acceptable for leader election when
compared to the newer replica). This means hours of rack imbalance
even when all consumers are caught up, which is a worse operational
state than the brief remote fetches this feature tries to prevent.

A time/bytes threshold directly bounds this: operators set a value
(e.g., 10 minutes, 100MB), and once a replica's local data spans that
threhold (duration or size), it becomes eligible for leadership. This
is intuitive, throughput-independent, and most broadly applicable
across varied workloads.

On operator confusion with local.retention.ms: we'd name these
distinctly under a leader election namespace, e.g.:

- leader.election.eligible.local.log.ms (default: -1, disabled)
- leader.election.eligible.local.log.bytes (default: -1, disabled)

These clearly describe leader election eligibility, not data retention
policy. Setting either > 0 enables the feature. No separate boolean is
needed, which simplifies the configuration compared to the original
proposal.

We are considering updating the KIP to reflect this approach.
Interested to hear your thoughts.

Thanks,
Tom

On Mon, May 11, 2026 at 2:07 PM Thomas Thornton
<[email protected]> wrote:
>
> Hi Manan,
>
> Thanks for the detailed discussion on this.
>
> Yes, this is a temporary and expected trade-off.
>
> The skew self-resolves: once the new replicas close the offset
> threshold gap, the stable sort treats them as equivalent and the
> original assignment order is restored. The auto leader rebalancer
> (runs every 5 minutes by default) then moves leadership back to the
> rack-balanced preferred replica automatically. Convergence occurs once
> the threshold is met (default 10k messages). In the case of high topic
> throughput, the gap between existing and new replicas can be very
> large (millions of messages), and convergence time may approach
> local.retention.ms.
>
> One alternative design is that the leader preference is determined
> based on a time threshold of local data. For example, it could be
> configured to 10 minutes. This means that replicas whose local data
> spans more than 10 minutes are preferred. The benefit here is that it
> is decoupled from kafka topic throughput rate and allows operators to
> easily set the maximum time a thin local data broker is de-prioritized
> (and thus the max time some rack imbalance may be present). This
> behaves well for consumers/followers that mostly keep up with
> real-time, and thus would not trigger any unnecessary remote fetches.
> We would consider updating the KIP to this approach if we agree it
> provides better utility. Similar to other properties (e.g.,
> retention.ms / retention.bytes), we'd propose this config as a
> time/bytes pair. Interested to hear your thoughts on this.
>
> If operators prefer maintaing rack balance over avoiding remote
> fetches, they simply don't enable the feature for the cluster, or
> disable it per-topic via the topic-level override.
>
> Thanks,
> Tom
>
> On Wed, Apr 29, 2026 at 8:48 AM Thomas Thornton
> <[email protected]> wrote:
> >
> > Hi Manan,
> >
> > Thanks for the discussion.
> >
> > MG1: Good point. I've updated the KIP to add topic-level overrides for
> > both configs so operators can tune the threshold per topic if the
> > cluster-wide default doesn't fit a particular workload.
> >
> > MG2: This shouldn't cause skew. The localLogStartOffset sorting only
> > pushes down newly-bootstrapped replicas that have significantly less
> > local data. For existing replicas with comparable local data, they'll
> > be within the threshold and fall back to the original assignment
> > order, same behavior as today. We're not introducing a new dimension
> > that would systematically favor certain racks or brokers.
> >
> > MG3: When tiered storage is not enabled for a topic, we will not send
> > AlterPartition requests to report localLogStartOffset. There will be
> > no extra control-plane overhead for clusters or topics that don't use
> > tiered storage. When enabled, the additional AlterPartition calls only
> > fire when an ISR member's localLogStartOffset actually changes, which
> > reuses the existing protocol and should be infrequent.
> >
> > Thanks,
> > Tom
> >
> > On Wed, Apr 29, 2026 at 5:42 PM Thomas Thornton
> > <[email protected]> wrote:
> > >
> > > Hi Ivan,
> > >
> > > Thanks for the feedback.
> > >
> > > IY1: Yes, the sort is stable. Replicas within the threshold are
> > > considered equivalent and retain their original assignment order.
> > > We're only reordering replicas that have significantly less local
> > > data, the existing replicas keep the same relative ordering as before.
> > > Updated that part of the KIP to reflect this.
> > >
> > > IY2: Good idea. I've updated the KIP to add topic-level overrides for
> > > both configs. This follows the standard Kafka pattern (like
> > > `retention.ms`, `log.retention.ms`). The cluster-wide default applies
> > > unless overridden per topic.
> > >
> > > Thanks,
> > > Tom
> > >
> > > On Fri, Apr 24, 2026 at 3:18 PM Ivan Yurchenko <[email protected]> wrote:
> > > >
> > > > Hi Thomas,
> > > >
> > > > Thank you for the KIP. The motivation makes sense to me. I have a 
> > > > couple of comments:
> > > >
> > > > IY1:
> > > > > When `leader.election.prefer.early.local.log.start.offset is 
> > > > > enabled`, the key change is to sort targetReplicas by 
> > > > > local-log-start-offset (ascending) before selecting a leader. This 
> > > > > ensures replicas with more local data (lower local-log-start-offset) 
> > > > > are considered first in both election paths.
> > > >
> > > > I assume here it meant to say "sort stably", to preserve the original 
> > > > preference order as much as possible?
> > > >
> > > > IY2:
> > > > Can we find a reason for a particular topic to not follow the new 
> > > > leader election algorithm, or it is strictly better and once enabled 
> > > > it's not expected to be disabled? If the answer is yes, would you 
> > > > consider adding the topic-level versions of the new configs 
> > > > leader.election.prefer.early.local.log.start.offset and 
> > > > leader.election.local.log.start.offset.threshold?
> > > >
> > > > Best,
> > > > Ivan
> > > >
> > > >
> > > > On Mon, Mar 30, 2026, at 20:43, Thomas Thornton via dev wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We want to start a discussion thread for KIP-1303: Deprioritize Tiered
> > > > Storage Followers In Leader Election.
> > > >
> > > > The adopted KIP-1023 introduced an optimization allowing followers to
> > > > skip replicating data already in remote storage, dramatically reducing
> > > > ISR join time. However, as noted in KIP-1023, this creates a risk: if
> > > > such a follower becomes leader, it may need to serve consumer requests
> > > > from remote storage, impacting performance.
> > > >
> > > > This KIP proposes to mitigate this risk by preferring replicas with
> > > > more local data (lower localLogStartOffset) during leader election.
> > > > Key changes include:
> > > > 1) New config leader.election.prefer.early.local.log.start.offset to
> > > > enable the feature
> > > > 2) New config leader.election.local.log.start.offset.threshold to
> > > > avoid leader churn from minor retention timing differences
> > > > 3) Extending FetchRequest and AlterPartition to propagate
> > > > localLogStartOffset from followers → leader → controller
> > > >
> > > > The full KIP is available here:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election
> > > >
> > > > Thanks,
> > > > Tom
> > > >
> > > >

Reply via email to