Hi Jun, Thanks for your comments!
JR1: You are correct that the segment rolling configurations are currently critical to balance the scalability of Diskless and Tiered Storage, as larger roll configurations benefit tiered storage, and smaller roll configurations benefit Diskless. To address your points specifically: (1) A Diskless topic which is cost-competitive with an equivalent Classic topic will have a metadata size <1% of the data size. A cluster storing 360GB of metadata will have >36TB of data under management and a retention of 5hr implies a throughput of >2GB/s. This will require multiple Diskless coordinators, which can share the load of storing the Diskless metadata, and serving Diskless requests. (2) Catching up consumers are intended to be served from tiered storage and local segment caches. Brokers which are building their local segment caches will have to read many files, but will amortize those reads by receiving data for multiple partitions in a single read. (3) This is a fundamental downside of storing data from multiple topics in a single object, similar to classic segments. We can implement a configurable cluster-wide maximum roll time, which would set the slowest cadence at which Tiered Storage segments are rolled from Diskless segments. If an individual partition has more aggressive roll settings, it may be rolled earlier. This configuration would permit the cluster operator to approximately bound the number of diskless WAL segments, which bounds the total size of the WAL segments, disk cache, diskless coordinator state, and excessive retention window. For example, a diskless.segment.ms of 15 minutes would reduce the metadata storage to 18GB, WAL segments to 1.8TB, and permit short-retention data to be physically deleted as soon as ~15 minutes after being produced. Of course, this will reduce the size of the tiered storage segments for topics that have low throughput, and where segment.ms > diskless.segment.ms, increasing overhead in the RLMM. We can perform merging/optimization of Tiered Storage segments to achieve the per-topic segment.ms. There were some reasons why we retracted the prior file-merging approach, and why merging in tiered storage appears better: * Rewriting files requires mutability for existing data, which adds complexity. Diskless batches or Remote Log Segments would need to be made mutable, and the remote log will be made mutable in KIP-1272 [1] * Because a WAL Segment can contain batches from multiple Diskless Coordinators, multiple coordinators must also be involved in the merging step. The Tiered Storage design has exclusive ownership for remote log segments within the RLMM. * Diskless file merging competes for resources with latency-sensitive producers and hot consumers. Tiered storage file merging competes for resources with lagging consumers, which are typically less latency sensitive. * Implementing merging in Tiered Storage allows this optimization to benefit both classic topics and diskless topics, covering both high and low throughput partitions. * Remote log segments may be optimized over much longer time windows rather than performing optimization once in the first few hours of the life of a WAL segment and then freezing the arrangement of the data until it is deleted. * File merging will need to rely on heuristics, which should be configurable by the user. Multi-partition heuristics are more complicated to describe and reason about than single-partition heuristics. What do you think of this alternative? JR2: Yes, the current default partition assignment strategy will need some improvement. This problem with Diskless WAL segments is analogous to the Classic topics’ dense inter-broker connection graph. The natural solution to this seems to be some sort of cellular design, where the replica placements tend to locate partitions in similar groups. Partitions in the same cell can generally share the same WAL Segments and the same Diskless Coordinator requests. This would also benefit Classic topics, which would need fewer connections and fetch requests. Such a feature is out-of-scope of this KIP, and either we will publish a follow-up KIP, or let operators and community tooling address this. JR3: Yes we will replace the ISR/ELR election logic for diskless topics, as they no longer rely on replicas for data integrity. We will fully model the state/lifecycle of the diskless replicas in KRaft, and choose how we display this to clients. For backwards compatibility, clients using older metadata requests should see diskless topics, but interpret them as classic topics. We could tell older clients that the leader is in the ISR, even if it just started building its cache. For clients using the latest metadata, they should see the true state of the diskless partition: which nodes can accept produce/fetch/sharefetch requests, which ranges of offsets are cached on-broker, etc. This could also be used to break apart the “leader” field into more granular fields, now that leadership has changed meaning. JR4: Yes, we can replace the empty fetch requests to the leader nodes with cache hint fields in the requests to the Diskless Coordinator, and rely on the coordinator to distribute cache hints to all replicas. This should be low-overhead, and eliminate the inter-broker communication for brokers which only host Diskless topics. JR5.1: You are correct and this text was ambiguous, only specifying that the controller waits for the sync to be complete. This section is now updated to explicitly say that local segments are built from object storage. JR5.2: Extending the JR2 discussion, reassignment of diskless topics would generally happen within a cell, where the marginal cost of reading an additional partition is very low. When cells are re-balanced and a partition is migrated between cells, there is a brief time (until the next Tiered Storage segment roll) when the marginal cost is doubled. This should be infrequent and well-amortized by other topics which aren’t being re-balanced between cells. JR6.1: We plan to move data from Diskless to Tiered Storage. Once the data is in Tiered Storage, it can be compacted using the functionality described in KIP-1272 [1] JR6.2: We will add details for this soon. JR7: We specify the requirement of eventual consistency to allow Diskless Topics to be used with other object storage implementations which aren’t the three major public clouds, such as self-managed software or weaker consistency caches. Thanks, Greg [1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <[email protected]> wrote: > Hi, Ivan, > > Thanks for the KIP. A few comments below. > > JR1. I am concerned about the usage of the current tiered storage to > control the number of small WAL files. Current tiered storage only tiers > the data when a segment rolls, which can take hours. This causes three > problems. (1) Much more metadata needs to be stored and maintained, which > increases the cost. Suppose that each segment rolls every 5 hours, each > partition generates 2 WAL files per second and each WAL file's metadata > takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * 100 = 3.6MB of > metadata. In a cluster with 100K partitions, this translates to 360GB of > metadata stored on the diskless coordinators. (2) A catching-up consumer's > performance degrades since it's forced to read data from many small WAL > files. (3) The data in WAL files could be retained much longer than > retention time. Since the small WAL files aren't completely deleted until > all partitions' data in it are obsolete, the deletion of the WAL files > could be delayed by hours or more. If the WAL file includes a partition > with a low retention time, the retention contract could be violated > significantly. The earlier design of the KIP included a separate object > merging process that combines small WAL files much more aggressively than > tiered storage, which seems to be a much better choice. > > JR2. I don't think the current default partition assignment strategy for > classic topics works for diskless topics. Current strategy tries to spread > the replicas to as many brokers as possible. For example, if a broker has > 100 partitions, their replicas could be spread over 100 brokers. If the > broker generates a WAL file with 100 partitions, this WAL file will be read > 100 times, once by each broker. S3 read cost is 1/12 of the cost of S3 put. > This assignment strategy will increase the S3 cost by about 8X, which is > prohibitive. We need to design a cost effective assignment strategy for > diskless topics. > > JR3. We need to think through the leade election logic with diskless topic. > The KIP tries to reuse the ISR logic for class topic, but it doesn't seem > very natural. > JR3.1 In classsic topic, the leader is always in ISR. In the diskless > topic, the KIP says that a leader could be out of sync. > JR3.2 The existing leader election logic based on ISR/ELR mainly retries to > preserve previously acknowledged data. With diskless topics, since the > object store provides durability, this logic seems no longer needed. The > existing min.isr and unclean leader election logic also don't apply. > > JR4. "Despite that there is no inter-broker replication, replicas will > still issue FetchRequest to leaders. Leaders will respond with empty (no > records) FetchResponse." > This seems unnatural. Could we avoid issuing inter broker fetch requests > for diskless topics? > > JR5. "The replica reassignment will follow the same flow as in classic > topic:". > JR5.1 Is this true? Since inter broker fetch response is alway empty, it > doesn't seem the current reassignment flow works for diskless topic. Also, > since the source of the data is object store, it seems more natural for a > replica to back fill the data from the object store, instead of other > replicas. This will also incur lower costs. > JR5.2 How do we prevent reassignment on diskless topics from causing the > same cost issue described in JR2? > > JR6." In other functional aspects, diskless topics are indistinguishable > from classic topics. This includes durability guarantees, ordering > guarantees, transactional and non-transactional producer API, consumer API, > consumer groups, share groups, data retention (deletion & compact)," > JR6.1 Could you describe how compact diskless topics are supported? > JR6.2 Neither this KIP nor KIP-1164 describes the transactional support in > detail. > > JR7. "Object Storage: A shared, durable, concurrent, and eventually > consistent storage supporting arbitrary sized byte values and a minimal set > of atomic operations: put, delete, list, and ranged get." > It seems that the object storage in all three major public clouds are > strongly consistent. > > Jun > > On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]> wrote: > > > Hi all, > > > > The parent KIP-1150 was voted for and accepted. Let's now focus on the > > technical details presented in this KIP-1163 and also in KIP-1164: > Diskless > > Coordinator [1]. > > > > Best, > > Ivan > > > > [1] > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator > > > > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote: > > > Hi all! > > > > > > We want to start the discussion thread for KIP-1163: Diskless Core [1], > > which is a sub-KIP for KIP-1150 [2]. > > > > > > Let's use the main KIP-1150 discuss thread [3] for high-level > questions, > > motivation, and general direction of the feature and this thread for > > particular details of implementation. > > > > > > Best, > > > Ivan > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > > > [2] > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > [3] https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d > > >
