Re: [DISCUSS] KIP-1163: Diskless Core

Jun Rao via dev Fri, 06 Mar 2026 16:13:57 -0800

Hi, Ivan,

Thanks for the KIP. A few comments below.

JR1. I am concerned about the usage of the current tiered storage to
control the number of small WAL files. Current tiered storage only tiers
the data when a segment rolls, which can take hours. This causes three
problems. (1) Much more metadata needs to be stored and maintained, which
increases the cost. Suppose that each segment rolls every 5 hours, each
partition generates 2 WAL files per second and each WAL file's metadata
takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * 100 = 3.6MB of
metadata. In a cluster with 100K partitions, this translates to 360GB of
metadata stored on the diskless coordinators. (2) A catching-up consumer's
performance degrades since it's forced to read data from many small WAL
files. (3) The data in WAL files could be retained much longer than
retention time. Since the small WAL files aren't completely deleted until
all partitions' data in it are obsolete, the deletion of the WAL files
could be delayed by hours or more. If the WAL file includes a partition
with a low retention time, the retention contract could be violated
significantly. The earlier design of the KIP included a separate object
merging process that combines small WAL files much more aggressively than
tiered storage, which seems to be a much better choice.

JR2. I don't think the current default partition assignment strategy for
classic topics works for diskless topics. Current strategy tries to spread
the replicas to as many brokers as possible. For example, if a broker has
100 partitions, their replicas could be spread over 100 brokers. If the
broker generates a WAL file with 100 partitions, this WAL file will be read
100 times, once by each broker. S3 read cost is 1/12 of the cost of S3 put.
This assignment strategy will increase the S3 cost by about 8X, which is
prohibitive. We need to design a cost effective assignment strategy for
diskless topics.

JR3. We need to think through the leade election logic with diskless topic.
The KIP tries to reuse the ISR logic for class topic, but it doesn't seem
very natural.
JR3.1 In classsic topic, the leader is always in ISR. In the diskless
topic, the KIP says that a leader could be out of sync.
JR3.2 The existing leader election logic based on ISR/ELR mainly retries to
preserve previously acknowledged data. With diskless topics, since the
object store provides durability, this logic seems no longer needed. The
existing min.isr and unclean leader election logic also don't apply.

JR4. "Despite that there is no inter-broker replication, replicas will
still issue FetchRequest to leaders. Leaders will respond with empty (no
records) FetchResponse."
This seems unnatural. Could we avoid issuing inter broker fetch requests
for diskless topics?

JR5. "The replica reassignment will follow the same flow as in classic
topic:".
JR5.1 Is this true? Since inter broker fetch response is alway empty, it
doesn't seem the current reassignment flow works for diskless topic. Also,
since the source of the data is object store, it seems more natural for a
replica to back fill the data from the object store, instead of other
replicas. This will also incur lower costs.
JR5.2 How do we prevent reassignment on diskless topics from causing the
same cost issue described in JR2?

JR6." In other functional aspects, diskless topics are indistinguishable
from classic topics. This includes durability guarantees, ordering
guarantees, transactional and non-transactional producer API, consumer API,
consumer groups, share groups, data retention (deletion & compact),"
JR6.1 Could you describe how compact diskless topics are supported?
JR6.2 Neither this KIP nor KIP-1164 describes the transactional support in
detail.

JR7. "Object Storage: A shared, durable, concurrent, and eventually
consistent storage supporting arbitrary sized byte values and a minimal set
of atomic operations: put, delete, list, and ranged get."
It seems that the object storage in all three major public clouds are
strongly consistent.

Jun

On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]> wrote:

> Hi all,
>
> The parent KIP-1150 was voted for and accepted. Let's now focus on the
> technical details presented in this KIP-1163 and also in KIP-1164: Diskless
> Coordinator  [1].
>
> Best,
> Ivan
>
> [1]
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
>
> On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
> > Hi all!
> >
> > We want to start the discussion thread for KIP-1163: Diskless Core [1],
> which is a sub-KIP for KIP-1150 [2].
> >
> > Let's use the main KIP-1150 discuss thread [3] for high-level questions,
> motivation, and general direction of the feature and this thread for
> particular details of implementation.
> >
> > Best,
> > Ivan
> >
> > [1]
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
> > [2]
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > [3] https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
>

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to