Re: [DISCUSS] KIP-1163: Diskless Core

Greg Harris via dev Thu, 19 Mar 2026 13:36:56 -0700

Hi Jun,

Thanks for your comments!

JR1:
You are correct that the segment rolling configurations are currently
critical to balance the scalability of Diskless and Tiered Storage, as
larger roll configurations benefit tiered storage, and smaller roll
configurations benefit Diskless.

To address your points specifically:
(1) A Diskless topic which is cost-competitive with an equivalent Classic
topic will have a metadata size <1% of the data size. A cluster storing
360GB of metadata will have >36TB of data under management and a retention
of 5hr implies a throughput of >2GB/s. This will require multiple Diskless
coordinators, which can share the load of storing the Diskless metadata,
and serving Diskless requests.
(2) Catching up consumers are intended to be served from tiered storage and
local segment caches. Brokers which are building their local segment caches
will have to read many files, but will amortize those reads by receiving
data for multiple partitions in a single read.
(3) This is a fundamental downside of storing data from multiple topics in
a single object, similar to classic segments. We can implement a
configurable cluster-wide maximum roll time, which would set the slowest
cadence at which Tiered Storage segments are rolled from Diskless segments.
If an individual partition has more aggressive roll settings, it may be
rolled earlier.
This configuration would permit the cluster operator to approximately bound
the number of diskless WAL segments, which bounds the total size of the WAL
segments, disk cache, diskless coordinator state, and excessive retention
window. For example, a diskless.segment.ms of 15 minutes would reduce the
metadata storage to 18GB, WAL segments to 1.8TB, and permit short-retention
data to be physically deleted as soon as ~15 minutes after being produced.
Of course, this will reduce the size of the tiered storage segments for
topics that have low throughput, and where segment.ms > diskless.segment.ms,
increasing overhead in the RLMM. We can perform merging/optimization of
Tiered Storage segments to achieve the per-topic segment.ms.
There were some reasons why we retracted the prior file-merging approach,
and why merging in tiered storage appears better:
* Rewriting files requires mutability for existing data, which adds
complexity. Diskless batches or Remote Log Segments would need to be made
mutable, and the remote log will be made mutable in KIP-1272 [1]
* Because a WAL Segment can contain batches from multiple Diskless
Coordinators, multiple coordinators must also be involved in the merging
step. The Tiered Storage design has exclusive ownership for remote log
segments within the RLMM.
* Diskless file merging competes for resources with latency-sensitive
producers and hot consumers. Tiered storage file merging competes for
resources with lagging consumers, which are typically less latency
sensitive.
* Implementing merging in Tiered Storage allows this optimization to
benefit both classic topics and diskless topics, covering both high and low
throughput partitions.
* Remote log segments may be optimized over much longer time windows rather
than performing optimization once in the first few hours of the life of a
WAL segment and then freezing the arrangement of the data until it is
deleted.
* File merging will need to rely on heuristics, which should be
configurable by the user. Multi-partition heuristics are more complicated
to describe and reason about than single-partition heuristics.
What do you think of this alternative?

JR2:
Yes, the current default partition assignment strategy will need some
improvement. This problem with Diskless WAL segments is analogous to the
Classic topics’ dense inter-broker connection graph.
The natural solution to this seems to be some sort of cellular design,
where the replica placements tend to locate partitions in similar groups.
Partitions in the same cell can generally share the same WAL Segments and
the same Diskless Coordinator requests. This would also benefit Classic
topics, which would need fewer connections and fetch requests.
Such a feature is out-of-scope of this KIP, and either we will publish a
follow-up KIP, or let operators and community tooling address this.

JR3:
Yes we will replace the ISR/ELR election logic for diskless topics, as they
no longer rely on replicas for data integrity. We will fully model the
state/lifecycle of the diskless replicas in KRaft, and choose how we
display this to clients.
For backwards compatibility, clients using older metadata requests should
see diskless topics, but interpret them as classic topics. We could tell
older clients that the leader is in the ISR, even if it just started
building its cache.
For clients using the latest metadata, they should see the true state of
the diskless partition: which nodes can accept produce/fetch/sharefetch
requests, which ranges of offsets are cached on-broker, etc. This could
also be used to break apart the “leader” field into more granular fields,
now that leadership has changed meaning.

JR4:
Yes, we can replace the empty fetch requests to the leader nodes with cache
hint fields in the requests to the Diskless Coordinator, and rely on the
coordinator to distribute cache hints to all replicas. This should be
low-overhead, and eliminate the inter-broker communication for brokers
which only host Diskless topics.

JR5.1:
You are correct and this text was ambiguous, only specifying that the
controller waits for the sync to be complete. This section is now updated
to explicitly say that local segments are built from object storage.

JR5.2:
Extending the JR2 discussion, reassignment of diskless topics would
generally happen within a cell, where the marginal cost of reading an
additional partition is very low. When cells are re-balanced and a
partition is migrated between cells, there is a brief time (until the next
Tiered Storage segment roll) when the marginal cost is doubled. This should
be infrequent and well-amortized by other topics which aren’t being
re-balanced between cells.

JR6.1:
We plan to move data from Diskless to Tiered Storage. Once the data is in
Tiered Storage, it can be compacted using the functionality described in
KIP-1272 [1]

JR6.2:
We will add details for this soon.

JR7:
We specify the requirement of eventual consistency to allow Diskless Topics
to be used with other object storage implementations which aren’t the three
major public clouds, such as self-managed software or weaker consistency
caches.

Thanks,
Greg

[1]
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage

On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <[email protected]> wrote:

> Hi, Ivan,
>
> Thanks for the KIP. A few comments below.
>
> JR1. I am concerned about the usage of the current tiered storage to
> control the number of small WAL files. Current tiered storage only tiers
> the data when a segment rolls, which can take hours. This causes three
> problems. (1) Much more metadata needs to be stored and maintained, which
> increases the cost. Suppose that each segment rolls every 5 hours, each
> partition generates 2 WAL files per second and each WAL file's metadata
> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * 100 = 3.6MB of
> metadata. In a cluster with 100K partitions, this translates to 360GB of
> metadata stored on the diskless coordinators. (2) A catching-up consumer's
> performance degrades since it's forced to read data from many small WAL
> files. (3) The data in WAL files could be retained much longer than
> retention time. Since the small WAL files aren't completely deleted until
> all partitions' data in it are obsolete, the deletion of the WAL files
> could be delayed by hours or more. If the WAL file includes a partition
> with a low retention time, the retention contract could be violated
> significantly. The earlier design of the KIP included a separate object
> merging process that combines small WAL files much more aggressively than
> tiered storage, which seems to be a much better choice.
>
> JR2. I don't think the current default partition assignment strategy for
> classic topics works for diskless topics. Current strategy tries to spread
> the replicas to as many brokers as possible. For example, if a broker has
> 100 partitions, their replicas could be spread over 100 brokers. If the
> broker generates a WAL file with 100 partitions, this WAL file will be read
> 100 times, once by each broker. S3 read cost is 1/12 of the cost of S3 put.
> This assignment strategy will increase the S3 cost by about 8X, which is
> prohibitive. We need to design a cost effective assignment strategy for
> diskless topics.
>
> JR3. We need to think through the leade election logic with diskless topic.
> The KIP tries to reuse the ISR logic for class topic, but it doesn't seem
> very natural.
> JR3.1 In classsic topic, the leader is always in ISR. In the diskless
> topic, the KIP says that a leader could be out of sync.
> JR3.2 The existing leader election logic based on ISR/ELR mainly retries to
> preserve previously acknowledged data. With diskless topics, since the
> object store provides durability, this logic seems no longer needed. The
> existing min.isr and unclean leader election logic also don't apply.
>
> JR4. "Despite that there is no inter-broker replication, replicas will
> still issue FetchRequest to leaders. Leaders will respond with empty (no
> records) FetchResponse."
> This seems unnatural. Could we avoid issuing inter broker fetch requests
> for diskless topics?
>
> JR5. "The replica reassignment will follow the same flow as in classic
> topic:".
> JR5.1 Is this true? Since inter broker fetch response is alway empty, it
> doesn't seem the current reassignment flow works for diskless topic. Also,
> since the source of the data is object store, it seems more natural for a
> replica to back fill the data from the object store, instead of other
> replicas. This will also incur lower costs.
> JR5.2 How do we prevent reassignment on diskless topics from causing the
> same cost issue described in JR2?
>
> JR6." In other functional aspects, diskless topics are indistinguishable
> from classic topics. This includes durability guarantees, ordering
> guarantees, transactional and non-transactional producer API, consumer API,
> consumer groups, share groups, data retention (deletion & compact),"
> JR6.1 Could you describe how compact diskless topics are supported?
> JR6.2 Neither this KIP nor KIP-1164 describes the transactional support in
> detail.
>
> JR7. "Object Storage: A shared, durable, concurrent, and eventually
> consistent storage supporting arbitrary sized byte values and a minimal set
> of atomic operations: put, delete, list, and ranged get."
> It seems that the object storage in all three major public clouds are
> strongly consistent.
>
> Jun
>
> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]> wrote:
>
> > Hi all,
> >
> > The parent KIP-1150 was voted for and accepted. Let's now focus on the
> > technical details presented in this KIP-1163 and also in KIP-1164:
> Diskless
> > Coordinator  [1].
> >
> > Best,
> > Ivan
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
> >
> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
> > > Hi all!
> > >
> > > We want to start the discussion thread for KIP-1163: Diskless Core [1],
> > which is a sub-KIP for KIP-1150 [2].
> > >
> > > Let's use the main KIP-1150 discuss thread [3] for high-level
> questions,
> > motivation, and general direction of the feature and this thread for
> > particular details of implementation.
> > >
> > > Best,
> > > Ivan
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
> > > [2]
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > [3] https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
> >
>

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to