Hi All, I would tie JR1 and JR11 together.
>From Jun: By "the first approach", do you mean aggressive tiering with faster segment rolling through the existing RLMM? I don't think the existing RLMM is designed to solve these issues due to inefficiencies in cost, metadata propagation and metadata storage as we previously discussed. >From Satish: RLMM was not designed for aggressive copying of the latest data to tiered storage by having small segment rollouts. >From Luke: I personally quite like the idea of delegating the tiny objects merging task to tiered storage. Sadly, there are some drawbacks that Jun pointed out. I agree that if we are using the aggressive tiering object solution, it might de-prioritize or delay progress of the classic tiered storage topics. Sorry, I realize now that "aggressive tiereing" was a confusing sentence, I meant solution (A) in my previous email. I was just saying that if we can decouple RLMM from diskless by using classic local logs to cache segments then we should be able to approximate the 87.5% cost saving target relatively well and create a bridge between diskless and tiered logs. Not saying this is the best solution because the RLMM bottleneck would still exist, but it is an option and I think it would be a good basis for an improvement that fixes these shortcomings. My reasons are the following: (a) Using the tiered storage framework has the advantage that existing integrations would fit into the diskless framework, but also it would be possible to switch between topic types. So a classic topic could be reconfigured to have a diskless head and vice versa. This gives the project great flexibility and compatibility with the existing features. Separating diskless storage entirely without data being able to cross this border would ultimately create a competing logging layer inside the project which may not be beneficial in the long term. (b) We can manage the RLMM weakness 2 ways: (i) improve the RLMM with snapshotting so it handles smaller log files better (ii) merge tiered storage segments with UploadPartCopy-like features or with concatenating "on the fly" without using any disk and minimal RAM (typically an UploadPartCopy has the same cost as a PUT). Index files need to be adjusted though. (c) cost-wise it seems very similar to diskless merging while having the advantages above. Compared to this, WAL merging:although might be marginally cheaper, it creates a competing log layer with no crossing between this and classic logs easily, but also won't be able to create optimal logs as merged segments would be mixed (if we just assume a concatenation merging strategy). I wouldn't do both solutions though, I agree with Luke in that one of them ideally would be enough to achieve the read optimization goal, although I can see that if we go with WAL merging, then in the future the need to cross these logging forms may appear which we may get relatively cheaply by improving RLMM to be able to handle this traffic. Thanks, Viktor On Fri, May 15, 2026 at 5:52 AM Luke Chen <[email protected]> wrote: > Hi Greg, > > I personally quite like the idea of delegating the tiny objects merging > task to tiered storage. > Sadly, there are some drawbacks that Jun pointed out. > I agree that if we are using the aggressive tiering object solution, it > might de-prioritize or delay progress of the classic tiered storage topics. > > > We can build the merging step to optimize WAL segments for more > predictable > rebuild times. But could we still perform a final move to Tiered Storage > after each partition reaches the configured roll times? > > I think you have your imagined use cases in the future. > But it doesn't make sense when you finally merge a 500 tiny small objects > into one big WAL segment, then you get rid of it and upload another copy of > log segment onto remote storage via tiered storage. Maybe you can consider > directly appending new metadata into RLMM to point to the location of the > merged WAL segments? > > > Thank you, > Luke > > On Fri, May 15, 2026 at 5:11 AM Greg Harris via dev <[email protected]> > wrote: > > > Jun & Satish, > > > > We can build the merging step to optimize WAL segments for more > predictable > > rebuild times. But could we still perform a final move to Tiered Storage > > after each partition reaches the configured roll times? We could expect > the > > same load/sizing expectations as classic topics (e.g. >1gb segments). > > > > We are interested in unifying with Tiered Storage for many reasons, but > > also so that topics which have diskless mode dynamically enabled/disabled > > can eventually converge to a predictable state. > > > > Thanks, > > Greg > > > > On Wed, May 13, 2026, 3:56 AM Satish Duggana <[email protected]> > > wrote: > > > > > RLMM was not designed for aggressive copying of the latest data to > > > tiered storage by having small segment rollouts. > > > > > > +1 to Jun on leaving the existing RLMM for classic topics with tiered > > > storage and having an efficient metadata management system required > > > for diskless topics. > > > > > > > > > On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]> > > > wrote: > > > > > > > > Hi, Victor, > > > > > > > > Thanks for the reply. > > > > > > > > JR1. (A) and (B) Yes, your summary matches my thinking. > > > > (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed > > with > > > > an aggressive tiered storage consolidation (the first approach)". > > > > Hmm, I am confused by the above statement. By "the first approach", > do > > > you > > > > mean aggressive tiering with faster segment rolling through the > > existing > > > > RLMM? I don't think the existing RLMM is designed to solve these > issues > > > due > > > > to inefficiencies in cost, metadata propagation and metadata storage > as > > > we > > > > previously discussed. > > > > > > > > JR11. I was thinking we leave the existing RLMM as is and continue to > > use > > > > it for classic topics. We design a new, more efficient metadata > > > management > > > > component independent of RLMM. This new component will be the only > > > metadata > > > > component that diskless topics depend on. > > > > > > > > Jun > > > > > > > > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass < > [email protected] > > > > > > > wrote: > > > > > > > > > Hi Jun, > > > > > > > > > > JR1 > > > > > (1)-(2)-(3) I'd address these together and let me explain our > current > > > idea > > > > > to solve the tiny object problem because I'm not sure if we're 100% > > > talking > > > > > about the same thing. I have two approaches in mind for TS > > > consolidation > > > > > ((A) and (B)) and I'm not sure if we're both assuming the same > idea, > > so > > > > > let's clarify this. > > > > > > > > > > (A) > > > > > This is our current assumption. This uses local disks (create > classic > > > > > local logs with UnifiedLog) to consolidate logs into the classic > log > > > format > > > > > and use RSM and RLMM to store them in tiered storage. This way > we're > > > not > > > > > limited by the need to have short rollovers. Local logs become a > form > > > of > > > > > staging environment to serve reads and accumulate records for > tiered > > > > > storage. This means that: > > > > > (a) Once a message is consolidated into the classic log format, we > > can > > > > > use it for serving lagging consumers. Diskless reads should really > be > > > used > > > > > for the head of the log and after a few seconds logs should be > > > consolidated. > > > > > (b) The real cost is much closer to that 87.5% (and in fact my > > google > > > > > sheet I shared also assumes this model) because we have more > freedom > > in > > > > > choosing the retention parameters of the classic log. > > > > > (c) Metadata is smaller as we only need to keep diskless segments > > > until > > > > > the tiered offset surpasses the individual batches' offset. > > > > > (d) RLMM metadata is also somewhat manageable due to the larger > > > segment > > > > > sizes but it's still possible to run into the metadata explosion > > > problem. > > > > > (e) It needs to rebuild this local log on reassignment to serve > > > lagging > > > > > consumers effectively, so reassignment is a bit more messy. > > > > > (f) It's not optimal when partitions have a single replica: on > > > failure we > > > > > can only fall back to diskless mode until the partition is > reassigned > > > to a > > > > > functioning broker. > > > > > > > > > > (B) > > > > > Compared to the above there can be an alternative approach, which > is > > to > > > > > consolidate when diskless segments expire (after 15 minutes for > > > instance). > > > > > In that case your points seem to fit better as: > > > > > (a) we can only use the classic, consolidated logs to serve > lagging > > > > > consumers after they have been tiered > > > > > (b) to be more efficient with lagging consumers we have to stick > to > > a > > > > > short rollover > > > > > (c) it's more costly due to the short rollovers > > > > > (d) the RLMM bottleneck still exists due to the short rollovers > > > > > (e) it's not given whether we use local disks for transforming > logs > > > as we > > > > > can do it in memory too (which can be ineffective and more > expensive) > > > but > > > > > perhaps a “chunked transfer encoding” that S3 supports or similar > > with > > > > > other providers is a cost effective way. If we know the final size > > > advance, > > > > > we can upload data in chunks and still get billed for 1 put. > > > > > (f) more efficient reassignment or failover is cleaner and faster > as > > > > > there isn't a need to rebuild local caches. > > > > > > > > > > (C) > > > > > Apart from the first 2 approaches there is a 3rd, which is WAL > > > merging. To > > > > > understand your points, let me summarize that I could gather so far > > as > > > > > reasons for WAL merging (and please correct me if I missed > > something): > > > > > (i) protecting consumer lag: small WAL files create inefficient > > > objects > > > > > for lagging consumers, so larger objects should be more efficient > > > > > (ii) avoiding the RLMM replay bottleneck: managing small segments > > with > > > > > RLMM is very inefficient (100s of GB metadata) > > > > > (iii) reducing batch metadata overhead: merging WAL files may > reduce > > > the > > > > > metadata we need to store, but it depends on the merge algorithm > and > > > how we > > > > > can compact batch data > > > > > (iv) cost effectiveness: retrieving merged WAL files reduces the > > > number > > > > > of get requests to object storage > > > > > (v) architectural redundancy with RLMM: ideally we wouldn't need 2 > > > > > solutions to 2 somewhat similar problems (tiered storage and > > diskless) > > > > > > > > > > Generally I think that (i) (ii) (iii) and (iv) may be addressed > with > > an > > > > > aggressive tiered storage consolidation (the first approach), so > the > > > only > > > > > remaining gap would be (v). I also agree that having 2 different > > > solutions > > > > > for metadata handling isn't ideal and perhaps there is a > possibility > > of > > > > > improvement here. It should be possible to redesign RLMM to be more > > > similar > > > > > to the diskless coordinator or design a common solution. > > > > > > > > > > JR11 > > > > > "If we support merging in the diskless coordinator, I wonder how > > useful > > > > > RLMM > > > > > is. It seems simpler to manage all metadata from the object store > in > > a > > > > > single place." > > > > > > > > > > Could you please clarify this a little bit? Do you think that we > > should > > > > > replace the RLMM with a solution that is more similar to the > diskless > > > > > coordinator or deprecate tiered storage altogether in favor of > > > diskless? > > > > > I'm not sure which option you're referring: > > > > > (1) Unify tiered storage and diskless under a single storage layer > > > (and > > > > > possibly deprecate tiered storage in favor of diskless with merging > > WAL > > > > > segments). > > > > > (2) Create a smart coordinator instead of RLMM and possibly unify > > > > > metadata coordination with diskless. > > > > > (3) Keep tiered storage and diskless separate with their own > > solutions > > > > > for metadata (probably not optimal). > > > > > > > > > > Thanks, > > > > > Viktor > > > > > > > > > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev < > [email protected] > > > > > > > > wrote: > > > > > > > > > >> Hi, Viktor and Greg, > > > > >> > > > > >> Thanks for the reply. > > > > >> > > > > >> JR1. > > > > >> 1) Thanks for verifying the cost estimation. I noticed a bug in my > > > earlier > > > > >> calculation. I estimated the per broker network transfer rate at > > > 2MB/sec. > > > > >> It should be 4MB/sec. If I correct it, the estimated savings are > > > similar > > > > >> to > > > > >> yours. > > > > >> The cost for transferring 4MB through the network is 4 * 2 * > 10^-5 = > > > $8* > > > > >> 10^-5 > > > > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The > savings > > > are > > > > >> about 87.5%. > > > > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The > savings > > > are > > > > >> 62.5%. > > > > >> Savings are still significantly lower when using RLMM. > > > > >> > > > > >> "To me it seems like that Greg's previous suggestion for a 15 min > > > rollover > > > > >> may be a bit too much. With 1 hour we can achieve better cost > saving > > > and > > > > >> less coordinate metadata being stored." > > > > >> This solves the cost issue, but it has other implications (see > point > > > 2) > > > > >> below). > > > > >> > > > > >> 2) "Yes, I think this is to be expected and a lot depends on the > > > > >> implementation. Ideally segments or chunks should be cached to > > > minimize > > > > >> the > > > > >> number of times segments pulled from remote storage." > > > > >> In a classic topic, when a consumer lags, its requests are served > > > either > > > > >> from the local cache or from large objects in the object store. > With > > > the > > > > >> current design in a diskless topic, lagging consumer requests > might > > be > > > > >> served from tiny 500-byte objects. This will significantly slow > down > > > the > > > > >> consumer's catch-up, which is not expected user behavior. Ideally, > > we > > > > >> don't > > > > >> want those tiny objects to last more than a few minutes, let alone > > an > > > > >> hour. > > > > >> > > > > >> 3) "I think if my calculations are correct (and we use a 60 minute > > > > >> window), > > > > >> then metadata generation should be slower, please see the google > > > sheet I > > > > >> linked above. I think given that traffic, the current topic based > > RLMM > > > > >> should be able to handle it." > > > > >> Why is a 60 minute window used? RLMM metadata needs to be retained > > > for the > > > > >> longest retention time among all topics. This means that the > > retention > > > > >> window can be weeks instead of 1 hour. This means that RLMM might > > > need to > > > > >> replay over 100GB of data during reassignment, which is not what > it > > is > > > > >> designed for. > > > > >> > > > > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline > > case, > > > > >> where > > > > >> there are some configurations which are not viable due to scale or > > > cost, > > > > >> and some that are. It would be up to the operator to tune their > > > cluster, > > > > >> by > > > > >> changing diskless.segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > > > >> < > > > > >> > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > > >> >, > > > > >> dividing up the cluster, or switching to a more scalable RLMM > > > > >> implementation." > > > > >> A broker with 4MB/sec produce throughput can probably be > considered > > > high > > > > >> throughput. Even with 4K partitions per broker, we could still > > > achieve an > > > > >> 87.5% cost saving as listed above, if we do the right > > implementation. > > > So, > > > > >> ideally, it would be useful to support that as well. > > > > >> > > > > >> JR11. "We had a short conversation with Greg and we came to the > > > conclusion > > > > >> that because of the explosiveness of diskless metadata, it may be > > > worth > > > > >> revisiting the merging case as it can indeed buy us some more cost > > > saving > > > > >> for the added complexity. " > > > > >> If we support merging in the diskless coordinator, I wonder how > > useful > > > > >> RLMM > > > > >> is. It seems simpler to manage all metadata from the object store > > in a > > > > >> single place. > > > > >> > > > > >> Jun > > > > >> > > > > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected] > > > > > wrote: > > > > >> > > > > >> > Hi Jun, > > > > >> > > > > > >> > Thank you for scrutinizing the scalability of the current > > > > >> > direct-to-tiered-storage strategy, and its metadata scalability. > > > > >> > > > > > >> > One of our implicit assumptions with this design was that users > > are > > > able > > > > >> > to choose between the Diskless and Classic mechanisms, and that > > any > > > > >> > situations where the Diskless design was deficient, the Classic > > > topics > > > > >> > could continue to be used. > > > > >> > This was originally applied to low-latency use-cases, but now > also > > > > >> applies > > > > >> > to low-throughput use-cases too. When the throughput on a topic > is > > > low, > > > > >> the > > > > >> > benefit of using Diskless is also low, because it is > proportional > > > to the > > > > >> > amount of data transferred, and it is more likely that the batch > > > > >> overhead > > > > >> > of the topics is significant. > > > > >> > In other words, we've been treating cost-effective support for > > > > >> arbitrarily > > > > >> > low throughput topics as a non-goal. > > > > >> > > > > > >> > Your example of 100,000 1kb/s partitions is a borderline case, > > where > > > > >> there > > > > >> > are some configurations which are not viable due to scale or > cost, > > > and > > > > >> some > > > > >> > that are. It would be up to the operator to tune their cluster, > by > > > > >> changing > > > > >> > diskless.segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > > > >> > < > > > > >> > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > > >> >, > > > > >> > dividing up the cluster, or switching to a more scalable RLMM > > > > >> > implementation. > > > > >> > > > > > >> > Do you think we should have cost-effective support for > arbitrarily > > > > >> > low-throughput partitions in Diskless? How much total demand is > > > there in > > > > >> > partitions where batches are >1kb but the partition throughput > is > > > > >> <1kb/s? > > > > >> > > > > > >> > Thanks, > > > > >> > Greg > > > > >> > > > > > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass < > > > [email protected] > > > > >> > > > > > >> > wrote: > > > > >> > > > > > >> >> Hi Jun, > > > > >> >> > > > > >> >> Regarding JR1. > > > > >> >> We had a short conversation with Greg and we came to the > > conclusion > > > > >> that > > > > >> >> because of the explosiveness of diskless metadata, it may be > > worth > > > > >> >> revisiting the merging case as it can indeed buy us some more > > cost > > > > >> saving > > > > >> >> for the added complexity. Also, it would support smaller topics > > > and we > > > > >> >> could somewhat manage the tiered storage consolidation costs. I > > > think > > > > >> that > > > > >> >> we would still need to consolidate WAL segments into tiered > > > storage. > > > > >> >> Reasons are: to limit WAL metadata, to be able to dynamically > > > > >> >> enable/disable diskless and to be compatible with existing and > > > future > > > > >> TS > > > > >> >> improvements. > > > > >> >> I'll try to refresh KIP-1165 and build it into the calculator > > > above (if > > > > >> >> it's possible at all :) ) and come back to you. > > > > >> >> Regardless, I just wanted to give a short update in the > meantime, > > > > >> looking > > > > >> >> forward to your answer. > > > > >> >> > > > > >> >> Best, > > > > >> >> Viktor > > > > >> >> > > > > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass < > > > > >> >> [email protected]> > > > > >> >> wrote: > > > > >> >> > > > > >> >> > Hi Jun, > > > > >> >> > > > > > >> >> > Thanks for the quick reply. > > > > >> >> > > > > > >> >> > JR1. > > > > >> >> > 1) Thanks for putting the numbers together. While your > > > calculation > > > > >> >> > seems to be correct in the sense that 6 PUTs would worsen the > > > cost > > > > >> >> saving > > > > >> >> > benefits, I think that in a byte for byte comparison there > is a > > > > >> bigger > > > > >> >> > difference. The reason is that the 4 tiered storage puts > > transfer > > > > >> much > > > > >> >> more > > > > >> >> > data compared to the small WAL segments, so in practice there > > > should > > > > >> be > > > > >> >> > fewer TS puts. > > > > >> >> > I made a google sheet calculator for this which I'd like to > > share > > > > >> with > > > > >> >> > you: > > > > >> >> > > > > > >> >> > > > > >> > > > > > > https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906 > > > > >> < > > > > > > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$ > > > > >> > > > > > >> >> > Please copy the sheet to modify the values. > > > > >> >> > About my findings: I was trying to create a similar cluster > > model > > > > >> that > > > > >> >> has > > > > >> >> > been discussed here previously to see how cost varies over > > > different > > > > >> >> > segment rollovers.To me it seems like that Greg's previous > > > suggestion > > > > >> >> for a > > > > >> >> > 15 min rollover may be a bit too much. With 1 hour we can > > achieve > > > > >> better > > > > >> >> > cost saving and less coordinate metadata being stored. I have > > > also > > > > >> >> tried to > > > > >> >> > account for the producer batch metadata generated by diskless > > > > >> partitions > > > > >> >> > but to me it seems like a lower number than Greg's original > > > numbers. > > > > >> >> > > > > > >> >> > 2) "Note that local storage could be lost on reassigned > > > partitions. > > > > >> In > > > > >> >> > that case, lagging reads can only be served from the object > > > store." > > > > >> >> > Yes, I think this is to be expected and a lot depends on the > > > > >> >> > implementation. Ideally segments or chunks should be cached > to > > > > >> minimize > > > > >> >> the > > > > >> >> > number of times segments pulled from remote storage. > > > > >> >> > > > > > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on > > the > > > > >> broker > > > > >> >> > instance type, a broker may only be able to handle low 10s of > > > MB/sec > > > > >> of > > > > >> >> > data. So, 2MB/sec overhead is significant." > > > > >> >> > Yes, I have indeed misunderstood, however I have updated my > > > > >> calculator > > > > >> >> > sheet with metadata calculation. Overall, the number of > tiered > > > > >> storage > > > > >> >> > segments created seems to be much lower than in your > > calculations > > > > >> given > > > > >> >> the > > > > >> >> > parameters of the cluster you specified earlier. Please take > a > > > look, > > > > >> I'd > > > > >> >> > like to really understand the thinking here because this is a > > > crucial > > > > >> >> point. > > > > >> >> > > > > > >> >> > 3) I think if my calculations are correct (and we use a 60 > > minute > > > > >> >> window), > > > > >> >> > then metadata generation should be slower, please see the > > google > > > > >> sheet I > > > > >> >> > linked above. I think given that traffic, the current topic > > based > > > > >> RLMM > > > > >> >> > should be able to handle it. > > > > >> >> > In the case where we would need to make the RLMM capable of > > > handling > > > > >> a > > > > >> >> > similar traffic as the diskless coordinator, then you're > right, > > > we > > > > >> >> probably > > > > >> >> > should consider how we can improve it. I think there are > > multiple > > > > >> >> > possibilities as you mentioned, but ideally there should be a > > > common > > > > >> >> > implementation for metadata coordination that could handle > > these > > > > >> cases. > > > > >> >> > > > > > >> >> > JR7. > > > > >> >> > Yes, your expectation is totally reasonable, we should expect > > > the get > > > > >> >> and > > > > >> >> > put operations to be strongly consistent for the > > read-after-write > > > > >> >> > scenarios. And I think that since major cloud providers give > > > strongly > > > > >> >> > consistent object storages, it should be sufficient for a > wide > > > > >> >> user-group. > > > > >> >> > So we could shrink the scope of the KIP a bit this way and > > avoid > > > > >> adding > > > > >> >> > complexity that is needed mostly on the margin. > > > > >> >> > I can expect though that "list" can stay eventually > consistent > > > as the > > > > >> >> KIP > > > > >> >> > relies on it for only garbage collection where it is fine if > a > > > few > > > > >> >> segments > > > > >> >> > can be collected only in the next iteration. > > > > >> >> > > > > > >> >> > JR3. > > > > >> >> > Since Greg hasn't replied yet, I'll try to catch up with him > > and > > > > >> >> formulate > > > > >> >> > an answer next week. > > > > >> >> > > > > > >> >> > Best, > > > > >> >> > Viktor > > > > >> >> > > > > > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev < > > > > >> [email protected]> > > > > >> >> > wrote: > > > > >> >> > > > > > >> >> >> Hi, Victor, > > > > >> >> >> > > > > >> >> >> Thanks for the reply. > > > > >> >> >> > > > > >> >> >> JR1. > > > > >> >> >> 1) "So while it seems to be significant that we tripled the > > > number > > > > >> of > > > > >> >> >> PUTs, cost-wise it doesn't seem to be significant." > > > > >> >> >> Let's compare the savings achieved by replacing network > > > replication > > > > >> >> >> transfer with S3 puts in AWS. > > > > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB > > > > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request > > > > >> >> >> > > > > >> >> >> The KIP batches data up to 4MB. So, let's assume that we > write > > > 2MB > > > > >> S3 > > > > >> >> >> objects on average. > > > > >> >> >> > > > > >> >> >> The cost for transferring 2MB through the network is 2 * 2 * > > > 10^-5 = > > > > >> >> $4* > > > > >> >> >> 10^-5 > > > > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The > > > savings > > > > >> >> are > > > > >> >> >> about 75%. > > > > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The > > > savings > > > > >> >> are > > > > >> >> >> 25%. As you can see, the savings are significantly lower. > > > > >> >> >> > > > > >> >> >> 2) "Therefore we could expect classic local segments to be > > > present > > > > >> >> which > > > > >> >> >> could be used for catching up consumers." > > > > >> >> >> Note that local storage could be lost on reassigned > > partitions. > > > In > > > > >> that > > > > >> >> >> case, lagging reads can only be served from the object > store. > > > > >> >> >> > > > > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the > > > 2GB/s > > > > >> >> >> throughput that Greg calculated previously, so I think it > > > should be > > > > >> >> >> manageable for a cluster with that amount of throughput," > > > > >> >> >> It seems that you didn't make the correct comparison. 2GB/s > > that > > > > >> Greg > > > > >> >> >> mentioned is the throughput for the whole cluster. The > > 2MB/sec I > > > > >> >> quoted is > > > > >> >> >> for a specific broker. Depending on the broker instance > type, > > a > > > > >> broker > > > > >> >> may > > > > >> >> >> only be able to handle low 10s of MB/sec of data. So, > 2MB/sec > > > > >> overhead > > > > >> >> is > > > > >> >> >> significant. > > > > >> >> >> > > > > >> >> >> 3) "I'd separate it from the discussion of diskless core and > > > > >> perhaps we > > > > >> >> >> could address it in a separate KIP as it is mostly a > redesign > > > of the > > > > >> >> >> RLMM." > > > > >> >> >> Those problems don't exist in the existing usage of RLMM. > They > > > > >> manifest > > > > >> >> >> because diskless tries to use RLMM in a way it wasn't > designed > > > for > > > > >> >> (there > > > > >> >> >> is at least a 20X increase in metadata). It would be useful > to > > > > >> consider > > > > >> >> >> whether fixing those problems in RLMM or using a new > approach > > is > > > > >> >> >> better. For example, KIP-1164 already introduces a > > snapshotting > > > > >> >> mechanism. > > > > >> >> >> Adding another snapshotting mechanism to RLMM seems > redundant. > > > > >> >> >> > > > > >> >> >> JR7. A typical object store supports 3 operations: puts, > gets > > > and > > > > >> >> lists. > > > > >> >> >> Which operations used by diskless can be eventually > > consistent? > > > I'd > > > > >> >> expect > > > > >> >> >> that get should always see the result of the latest put. > > > > >> >> >> > > > > >> >> >> Jun > > > > >> >> >> > > > > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass < > > > > >> [email protected] > > > > >> >> > > > > > >> >> >> wrote: > > > > >> >> >> > > > > >> >> >> > Hi Jun, > > > > >> >> >> > > > > > >> >> >> > I'd like to add my thoughts too until Greg has time to > > > respond. > > > > >> >> >> > > > > > >> >> >> > JR1. I also think there are shortcomings in the current > > tiered > > > > >> >> storage > > > > >> >> >> > design, around the RLMM. > > > > >> >> >> > 1) I think this is a correct observation, however if my > > > > >> calculations > > > > >> >> are > > > > >> >> >> > correct, it actually comes down to a negligible amount of > > > cost. > > > > >> >> Taking > > > > >> >> >> the > > > > >> >> >> > AWS pricing sheet at > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps > > > > >> < > > > > > > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$ > > > > >> > > > > > >> >> >> > it seems like the difference between 6 or 2 PUTs per > second > > is > > > > >> ~$52 > > > > >> >> for > > > > >> >> >> a > > > > >> >> >> > month. The calculation follows > > > > >> >> >> > as: > > 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. > > > So > > > > >> >> while > > > > >> >> >> it > > > > >> >> >> > seems to be significant that we tripled the number of > PUTs, > > > > >> >> cost-wise it > > > > >> >> >> > doesn't seem to be significant. > > > > >> >> >> > 2) Reflecting to your original problem: the tiered storage > > > > >> >> consolidation > > > > >> >> >> > process should be continuously running and transforming > WAL > > > > >> segments > > > > >> >> >> into > > > > >> >> >> > classic logs. Therefore we could expect classic local > > > segments to > > > > >> be > > > > >> >> >> > present which could be used for catching up consumers. So > > they > > > > >> would > > > > >> >> >> only > > > > >> >> >> > switch to WAL reading when they're close to the end of the > > > log. > > > > >> Since > > > > >> >> >> this > > > > >> >> >> > offset space should be cached, the reads from there should > > be > > > > >> fast. > > > > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below > the > > > 2GB/s > > > > >> >> >> > throughput that Greg calculated previously, so I think it > > > should > > > > >> be > > > > >> >> >> > manageable for a cluster with that amount of throughput, > > > although > > > > >> I > > > > >> >> >> agree > > > > >> >> >> > with your comment that the current topic based tiered > > metadata > > > > >> >> manager > > > > >> >> >> > isn't optimal and we could develop a better solution. > > > > >> >> >> > 3) Tied to the previous point, I agree that your comments > > are > > > > >> >> absolutely > > > > >> >> >> > valid, however similarly to that, I'd separate it from the > > > > >> >> discussion of > > > > >> >> >> > diskless core and perhaps we could address it in a > separate > > > KIP as > > > > >> >> it is > > > > >> >> >> > mostly a redesign of the RLMM. > > > > >> >> >> > > > > > >> >> >> > JR2. Ack. We will raise a KIP in the near future. > > > > >> >> >> > > > > > >> >> >> > JR3. I'd leave answering this to Greg as I don't have too > > much > > > > >> >> context > > > > >> >> >> on > > > > >> >> >> > this one. > > > > >> >> >> > > > > > >> >> >> > JR7. I think this could be similar to the tiered storage > > > design, > > > > >> so > > > > >> >> any > > > > >> >> >> > coordinator operation should be strongly consistent (since > > > we're > > > > >> >> using > > > > >> >> >> > classic topics there). Therefore the WAL segment storage > > layer > > > > >> could > > > > >> >> be > > > > >> >> >> > eventually consistent as we store its metadata in a > strongly > > > > >> >> consistent > > > > >> >> >> > manner. I'm not sure though if this was the answer you're > > > looking > > > > >> >> for? > > > > >> >> >> > > > > > >> >> >> > Best, > > > > >> >> >> > Viktor > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev < > > > > >> >> [email protected]> > > > > >> >> >> > wrote: > > > > >> >> >> > > > > > >> >> >> >> Hi, Greg, > > > > >> >> >> >> > > > > >> >> >> >> Thanks for the reply. > > > > >> >> >> >> > > > > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the > 3 > > > > >> concerns > > > > >> >> I > > > > >> >> >> >> listed, but it introduces some new issues because it > > doesn't > > > > >> quite > > > > >> >> fit > > > > >> >> >> the > > > > >> >> >> >> design of the current tiered storage. (a) The current > > tiered > > > > >> storage > > > > >> >> >> >> design > > > > >> >> >> >> stores a single partition per object. If we roll a log > > > segment > > > > >> >> every 15 > > > > >> >> >> >> minutes, with 4K partitions per broker, this means an > > > additional > > > > >> 4 > > > > >> >> S3 > > > > >> >> >> puts > > > > >> >> >> >> per second. The diskless design aims for 2 S3 puts per > > > second. > > > > >> So, > > > > >> >> this > > > > >> >> >> >> triples the S3 put cost and reduces the savings benefits. > > (b) > > > > >> With > > > > >> >> Tier > > > > >> >> >> >> storage, each broker essentially needs to read the tier > > > metadata > > > > >> >> from > > > > >> >> >> all > > > > >> >> >> >> tier metadata partitions if the number of user partitions > > > exceeds > > > > >> >> 50. > > > > >> >> >> >> Assuming that we generate 100 bytes of tier metadata per > > > > >> partition > > > > >> >> >> every > > > > >> >> >> >> 15 > > > > >> >> >> >> minutes. Assuming that each broker has 4K partitions and > a > > > > >> cluster > > > > >> >> of > > > > >> >> >> 500 > > > > >> >> >> >> brokers. Each broker needs to receive tier metadata at a > > > rate of > > > > >> >> 100 * > > > > >> >> >> 4K > > > > >> >> >> >> * > > > > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of > > the > > > 50 > > > > >> tier > > > > >> >> >> >> metadata topic partitions, it needs to send out metadata > at > > > 100 * > > > > >> >> 4K * > > > > >> >> >> 500 > > > > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases > > unnecessary > > > > >> network > > > > >> >> >> and > > > > >> >> >> >> CPU overhead. (c) Tier storage doesn't support > snapshots. A > > > > >> >> restarted > > > > >> >> >> >> broker needs to replay the tier metadata log from the > > > beginning > > > > >> to > > > > >> >> >> build > > > > >> >> >> >> the tier metadata state. Suppose that the tier metadata > log > > > is > > > > >> kept > > > > >> >> >> for 7 > > > > >> >> >> >> days. The total amount of tier metadata that needs to be > > > > >> replayed is > > > > >> >> >> 200KB > > > > >> >> >> >> * 7 * 24 * 3600 = 120GB. > > > > >> >> >> >> Does the merging optimization you mentioned address those > > new > > > > >> >> >> concerns? If > > > > >> >> >> >> so, could you describe how it works? > > > > >> >> >> >> > > > > >> >> >> >> JR2. It's fine to cover the default partition assignment > > > strategy > > > > >> >> for > > > > >> >> >> >> diskless topics in a separate KIP. However, since this is > > > > >> essential > > > > >> >> for > > > > >> >> >> >> achieving the cost saving goal, we need a solution before > > > > >> releasing > > > > >> >> the > > > > >> >> >> >> diskless KIP. > > > > >> >> >> >> > > > > >> >> >> >> JR3. Sounds good. Could you document how this work? > > > > >> >> >> >> > > > > >> >> >> >> JR7. Could you describe which parts of the operation can > be > > > > >> >> eventually > > > > >> >> >> >> consistent? > > > > >> >> >> >> > > > > >> >> >> >> Jun > > > > >> >> >> >> > > > > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris < > > > > >> [email protected]> > > > > >> >> >> wrote: > > > > >> >> >> >> > > > > >> >> >> >> > Hi Jun, > > > > >> >> >> >> > > > > > >> >> >> >> > Thanks for your comments! > > > > >> >> >> >> > > > > > >> >> >> >> > JR1: > > > > >> >> >> >> > You are correct that the segment rolling configurations > > are > > > > >> >> currently > > > > >> >> >> >> > critical to balance the scalability of Diskless and > > Tiered > > > > >> >> Storage, > > > > >> >> >> as > > > > >> >> >> >> > larger roll configurations benefit tiered storage, and > > > smaller > > > > >> >> roll > > > > >> >> >> >> > configurations benefit Diskless. > > > > >> >> >> >> > > > > > >> >> >> >> > To address your points specifically: > > > > >> >> >> >> > (1) A Diskless topic which is cost-competitive with an > > > > >> equivalent > > > > >> >> >> >> Classic > > > > >> >> >> >> > topic will have a metadata size <1% of the data size. A > > > cluster > > > > >> >> >> storing > > > > >> >> >> >> > 360GB of metadata will have >36TB of data under > > management > > > and > > > > >> a > > > > >> >> >> >> retention > > > > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will > require > > > > >> multiple > > > > >> >> >> >> Diskless > > > > >> >> >> >> > coordinators, which can share the load of storing the > > > Diskless > > > > >> >> >> metadata, > > > > >> >> >> >> > and serving Diskless requests. > > > > >> >> >> >> > (2) Catching up consumers are intended to be served > from > > > tiered > > > > >> >> >> storage > > > > >> >> >> >> > and local segment caches. Brokers which are building > > their > > > > >> local > > > > >> >> >> segment > > > > >> >> >> >> > caches will have to read many files, but will amortize > > > those > > > > >> >> reads by > > > > >> >> >> >> > receiving data for multiple partitions in a single > read. > > > > >> >> >> >> > (3) This is a fundamental downside of storing data from > > > > >> multiple > > > > >> >> >> topics > > > > >> >> >> >> in > > > > >> >> >> >> > a single object, similar to classic segments. We can > > > implement > > > > >> a > > > > >> >> >> >> > configurable cluster-wide maximum roll time, which > would > > > set > > > > >> the > > > > >> >> >> slowest > > > > >> >> >> >> > cadence at which Tiered Storage segments are rolled > from > > > > >> Diskless > > > > >> >> >> >> segments. > > > > >> >> >> >> > If an individual partition has more aggressive roll > > > settings, > > > > >> it > > > > >> >> may > > > > >> >> >> be > > > > >> >> >> >> > rolled earlier. > > > > >> >> >> >> > This configuration would permit the cluster operator to > > > > >> >> approximately > > > > >> >> >> >> > bound the number of diskless WAL segments, which bounds > > the > > > > >> total > > > > >> >> >> size > > > > >> >> >> >> of > > > > >> >> >> >> > the WAL segments, disk cache, diskless coordinator > state, > > > and > > > > >> >> >> excessive > > > > >> >> >> >> > retention window. For example, a diskless.segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > > >> > > > > > >> >> of 15 minutes > > > > >> >> >> >> would > > > > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to > > > 1.8TB, and > > > > >> >> >> permit > > > > >> >> >> >> > short-retention data to be physically deleted as soon > as > > > ~15 > > > > >> >> minutes > > > > >> >> >> >> after > > > > >> >> >> >> > being produced. > > > > >> >> >> >> > Of course, this will reduce the size of the tiered > > storage > > > > >> >> segments > > > > >> >> >> for > > > > >> >> >> >> > topics that have low throughput, and where segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > > > >> > > > > > >> >> > > > > > >> >> >> >> > diskless.segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > > >> >, > > > > >> >> increasing overhead in the RLMM. We can perform > > > > >> >> >> >> > merging/optimization of Tiered Storage segments to > > achieve > > > the > > > > >> >> >> per-topic > > > > >> >> >> >> > segment.ms > > > > >> < > > > > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > > > >> > > > > > >> >> . > > > > >> >> >> >> > There were some reasons why we retracted the prior > > > file-merging > > > > >> >> >> >> approach, > > > > >> >> >> >> > and why merging in tiered storage appears better: > > > > >> >> >> >> > * Rewriting files requires mutability for existing > data, > > > which > > > > >> >> adds > > > > >> >> >> >> > complexity. Diskless batches or Remote Log Segments > would > > > need > > > > >> to > > > > >> >> be > > > > >> >> >> >> made > > > > >> >> >> >> > mutable, and the remote log will be made mutable in > > > KIP-1272 > > > > >> [1] > > > > >> >> >> >> > * Because a WAL Segment can contain batches from > multiple > > > > >> Diskless > > > > >> >> >> >> > Coordinators, multiple coordinators must also be > involved > > > in > > > > >> the > > > > >> >> >> merging > > > > >> >> >> >> > step. The Tiered Storage design has exclusive ownership > > for > > > > >> remote > > > > >> >> >> log > > > > >> >> >> >> > segments within the RLMM. > > > > >> >> >> >> > * Diskless file merging competes for resources with > > > > >> >> latency-sensitive > > > > >> >> >> >> > producers and hot consumers. Tiered storage file > merging > > > > >> competes > > > > >> >> for > > > > >> >> >> >> > resources with lagging consumers, which are typically > > less > > > > >> latency > > > > >> >> >> >> > sensitive. > > > > >> >> >> >> > * Implementing merging in Tiered Storage allows this > > > > >> optimization > > > > >> >> to > > > > >> >> >> >> > benefit both classic topics and diskless topics, > covering > > > both > > > > >> >> high > > > > >> >> >> and > > > > >> >> >> >> low > > > > >> >> >> >> > throughput partitions. > > > > >> >> >> >> > * Remote log segments may be optimized over much longer > > > time > > > > >> >> windows > > > > >> >> >> >> > rather than performing optimization once in the first > few > > > > >> hours of > > > > >> >> >> the > > > > >> >> >> >> life > > > > >> >> >> >> > of a WAL segment and then freezing the arrangement of > the > > > data > > > > >> >> until > > > > >> >> >> it > > > > >> >> >> >> is > > > > >> >> >> >> > deleted. > > > > >> >> >> >> > * File merging will need to rely on heuristics, which > > > should be > > > > >> >> >> >> > configurable by the user. Multi-partition heuristics > are > > > more > > > > >> >> >> >> complicated > > > > >> >> >> >> > to describe and reason about than single-partition > > > heuristics. > > > > >> >> >> >> > What do you think of this alternative? > > > > >> >> >> >> > > > > > >> >> >> >> > JR2: > > > > >> >> >> >> > Yes, the current default partition assignment strategy > > will > > > > >> need > > > > >> >> some > > > > >> >> >> >> > improvement. This problem with Diskless WAL segments is > > > > >> analogous > > > > >> >> to > > > > >> >> >> the > > > > >> >> >> >> > Classic topics’ dense inter-broker connection graph. > > > > >> >> >> >> > The natural solution to this seems to be some sort of > > > cellular > > > > >> >> >> design, > > > > >> >> >> >> > where the replica placements tend to locate partitions > in > > > > >> similar > > > > >> >> >> >> groups. > > > > >> >> >> >> > Partitions in the same cell can generally share the > same > > > WAL > > > > >> >> Segments > > > > >> >> >> >> and > > > > >> >> >> >> > the same Diskless Coordinator requests. This would also > > > benefit > > > > >> >> >> Classic > > > > >> >> >> >> > topics, which would need fewer connections and fetch > > > requests. > > > > >> >> >> >> > Such a feature is out-of-scope of this KIP, and either > we > > > will > > > > >> >> >> publish a > > > > >> >> >> >> > follow-up KIP, or let operators and community tooling > > > address > > > > >> >> this. > > > > >> >> >> >> > > > > > >> >> >> >> > JR3: > > > > >> >> >> >> > Yes we will replace the ISR/ELR election logic for > > diskless > > > > >> >> topics, > > > > >> >> >> as > > > > >> >> >> >> > they no longer rely on replicas for data integrity. We > > will > > > > >> fully > > > > >> >> >> model > > > > >> >> >> >> the > > > > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and > > > choose > > > > >> how > > > > >> >> we > > > > >> >> >> >> > display this to clients. > > > > >> >> >> >> > For backwards compatibility, clients using older > metadata > > > > >> requests > > > > >> >> >> >> should > > > > >> >> >> >> > see diskless topics, but interpret them as classic > > topics. > > > We > > > > >> >> could > > > > >> >> >> tell > > > > >> >> >> >> > older clients that the leader is in the ISR, even if it > > > just > > > > >> >> started > > > > >> >> >> >> > building its cache. > > > > >> >> >> >> > For clients using the latest metadata, they should see > > the > > > true > > > > >> >> >> state of > > > > >> >> >> >> > the diskless partition: which nodes can accept > > > > >> >> >> produce/fetch/sharefetch > > > > >> >> >> >> > requests, which ranges of offsets are cached on-broker, > > > etc. > > > > >> This > > > > >> >> >> could > > > > >> >> >> >> > also be used to break apart the “leader” field into > more > > > > >> granular > > > > >> >> >> >> fields, > > > > >> >> >> >> > now that leadership has changed meaning. > > > > >> >> >> >> > > > > > >> >> >> >> > JR4: > > > > >> >> >> >> > Yes, we can replace the empty fetch requests to the > > leader > > > > >> nodes > > > > >> >> with > > > > >> >> >> >> > cache hint fields in the requests to the Diskless > > > Coordinator, > > > > >> and > > > > >> >> >> rely > > > > >> >> >> >> on > > > > >> >> >> >> > the coordinator to distribute cache hints to all > > replicas. > > > This > > > > >> >> >> should > > > > >> >> >> >> be > > > > >> >> >> >> > low-overhead, and eliminate the inter-broker > > communication > > > for > > > > >> >> >> brokers > > > > >> >> >> >> > which only host Diskless topics. > > > > >> >> >> >> > > > > > >> >> >> >> > JR5.1: > > > > >> >> >> >> > You are correct and this text was ambiguous, only > > > specifying > > > > >> that > > > > >> >> the > > > > >> >> >> >> > controller waits for the sync to be complete. This > > section > > > is > > > > >> now > > > > >> >> >> >> updated > > > > >> >> >> >> > to explicitly say that local segments are built from > > object > > > > >> >> storage. > > > > >> >> >> >> > > > > > >> >> >> >> > JR5.2: > > > > >> >> >> >> > Extending the JR2 discussion, reassignment of diskless > > > topics > > > > >> >> would > > > > >> >> >> >> > generally happen within a cell, where the marginal cost > > of > > > > >> >> reading an > > > > >> >> >> >> > additional partition is very low. When cells are > > > re-balanced > > > > >> and a > > > > >> >> >> >> > partition is migrated between cells, there is a brief > > time > > > > >> (until > > > > >> >> the > > > > >> >> >> >> next > > > > >> >> >> >> > Tiered Storage segment roll) when the marginal cost is > > > doubled. > > > > >> >> This > > > > >> >> >> >> should > > > > >> >> >> >> > be infrequent and well-amortized by other topics which > > > aren’t > > > > >> >> being > > > > >> >> >> >> > re-balanced between cells. > > > > >> >> >> >> > > > > > >> >> >> >> > JR6.1: > > > > >> >> >> >> > We plan to move data from Diskless to Tiered Storage. > > Once > > > the > > > > >> >> data > > > > >> >> >> is > > > > >> >> >> >> in > > > > >> >> >> >> > Tiered Storage, it can be compacted using the > > functionality > > > > >> >> >> described in > > > > >> >> >> >> > KIP-1272 [1] > > > > >> >> >> >> > > > > > >> >> >> >> > JR6.2: > > > > >> >> >> >> > We will add details for this soon. > > > > >> >> >> >> > > > > > >> >> >> >> > JR7: > > > > >> >> >> >> > We specify the requirement of eventual consistency to > > allow > > > > >> >> Diskless > > > > >> >> >> >> > Topics to be used with other object storage > > implementations > > > > >> which > > > > >> >> >> aren’t > > > > >> >> >> >> > the three major public clouds, such as self-managed > > > software or > > > > >> >> >> weaker > > > > >> >> >> >> > consistency caches. > > > > >> >> >> >> > > > > > >> >> >> >> > Thanks, > > > > >> >> >> >> > Greg > > > > >> >> >> >> > > > > > >> >> >> >> > [1] > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage > > > > >> < > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$ > > > > >> > > > > > >> >> >> >> > > > > > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev < > > > > >> >> [email protected] > > > > >> >> >> > > > > > >> >> >> >> > wrote: > > > > >> >> >> >> > > > > > >> >> >> >> >> Hi, Ivan, > > > > >> >> >> >> >> > > > > >> >> >> >> >> Thanks for the KIP. A few comments below. > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR1. I am concerned about the usage of the current > > tiered > > > > >> >> storage to > > > > >> >> >> >> >> control the number of small WAL files. Current tiered > > > storage > > > > >> >> only > > > > >> >> >> >> tiers > > > > >> >> >> >> >> the data when a segment rolls, which can take hours. > > This > > > > >> causes > > > > >> >> >> three > > > > >> >> >> >> >> problems. (1) Much more metadata needs to be stored > and > > > > >> >> maintained, > > > > >> >> >> >> which > > > > >> >> >> >> >> increases the cost. Suppose that each segment rolls > > every > > > 5 > > > > >> >> hours, > > > > >> >> >> each > > > > >> >> >> >> >> partition generates 2 WAL files per second and each > WAL > > > file's > > > > >> >> >> metadata > > > > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * > 3.6K * > > > 2 * > > > > >> 100 > > > > >> >> = > > > > >> >> >> >> 3.6MB > > > > >> >> >> >> >> of > > > > >> >> >> >> >> metadata. In a cluster with 100K partitions, this > > > translates > > > > >> to > > > > >> >> >> 360GB > > > > >> >> >> >> of > > > > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A > > > > >> catching-up > > > > >> >> >> >> consumer's > > > > >> >> >> >> >> performance degrades since it's forced to read data > from > > > many > > > > >> >> small > > > > >> >> >> WAL > > > > >> >> >> >> >> files. (3) The data in WAL files could be retained > much > > > longer > > > > >> >> than > > > > >> >> >> >> >> retention time. Since the small WAL files aren't > > > completely > > > > >> >> deleted > > > > >> >> >> >> until > > > > >> >> >> >> >> all partitions' data in it are obsolete, the deletion > of > > > the > > > > >> WAL > > > > >> >> >> files > > > > >> >> >> >> >> could be delayed by hours or more. If the WAL file > > > includes a > > > > >> >> >> partition > > > > >> >> >> >> >> with a low retention time, the retention contract > could > > be > > > > >> >> violated > > > > >> >> >> >> >> significantly. The earlier design of the KIP included > a > > > > >> separate > > > > >> >> >> object > > > > >> >> >> >> >> merging process that combines small WAL files much > more > > > > >> >> aggressively > > > > >> >> >> >> than > > > > >> >> >> >> >> tiered storage, which seems to be a much better > choice. > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR2. I don't think the current default partition > > > assignment > > > > >> >> strategy > > > > >> >> >> >> for > > > > >> >> >> >> >> classic topics works for diskless topics. Current > > strategy > > > > >> tries > > > > >> >> to > > > > >> >> >> >> spread > > > > >> >> >> >> >> the replicas to as many brokers as possible. For > > example, > > > if a > > > > >> >> >> broker > > > > >> >> >> >> has > > > > >> >> >> >> >> 100 partitions, their replicas could be spread over > 100 > > > > >> brokers. > > > > >> >> If > > > > >> >> >> the > > > > >> >> >> >> >> broker generates a WAL file with 100 partitions, this > > WAL > > > file > > > > >> >> will > > > > >> >> >> be > > > > >> >> >> >> >> read > > > > >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 > of > > > the > > > > >> cost > > > > >> >> of > > > > >> >> >> S3 > > > > >> >> >> >> >> put. > > > > >> >> >> >> >> This assignment strategy will increase the S3 cost by > > > about > > > > >> 8X, > > > > >> >> >> which > > > > >> >> >> >> is > > > > >> >> >> >> >> prohibitive. We need to design a cost effective > > assignment > > > > >> >> strategy > > > > >> >> >> for > > > > >> >> >> >> >> diskless topics. > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR3. We need to think through the leade election logic > > > with > > > > >> >> diskless > > > > >> >> >> >> >> topic. > > > > >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, > > but > > > it > > > > >> >> doesn't > > > > >> >> >> >> seem > > > > >> >> >> >> >> very natural. > > > > >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. > In > > > the > > > > >> >> >> diskless > > > > >> >> >> >> >> topic, the KIP says that a leader could be out of > sync. > > > > >> >> >> >> >> JR3.2 The existing leader election logic based on > > ISR/ELR > > > > >> mainly > > > > >> >> >> >> retries > > > > >> >> >> >> >> to > > > > >> >> >> >> >> preserve previously acknowledged data. With diskless > > > topics, > > > > >> >> since > > > > >> >> >> the > > > > >> >> >> >> >> object store provides durability, this logic seems no > > > longer > > > > >> >> needed. > > > > >> >> >> >> The > > > > >> >> >> >> >> existing min.isr and unclean leader election logic > also > > > don't > > > > >> >> apply. > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR4. "Despite that there is no inter-broker > replication, > > > > >> replicas > > > > >> >> >> will > > > > >> >> >> >> >> still issue FetchRequest to leaders. Leaders will > > respond > > > with > > > > >> >> empty > > > > >> >> >> >> (no > > > > >> >> >> >> >> records) FetchResponse." > > > > >> >> >> >> >> This seems unnatural. Could we avoid issuing inter > > broker > > > > >> fetch > > > > >> >> >> >> requests > > > > >> >> >> >> >> for diskless topics? > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR5. "The replica reassignment will follow the same > flow > > > as in > > > > >> >> >> classic > > > > >> >> >> >> >> topic:". > > > > >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response > is > > > alway > > > > >> >> >> empty, > > > > >> >> >> >> it > > > > >> >> >> >> >> doesn't seem the current reassignment flow works for > > > diskless > > > > >> >> topic. > > > > >> >> >> >> Also, > > > > >> >> >> >> >> since the source of the data is object store, it seems > > > more > > > > >> >> natural > > > > >> >> >> >> for a > > > > >> >> >> >> >> replica to back fill the data from the object store, > > > instead > > > > >> of > > > > >> >> >> other > > > > >> >> >> >> >> replicas. This will also incur lower costs. > > > > >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless > topics > > > from > > > > >> >> causing > > > > >> >> >> >> the > > > > >> >> >> >> >> same cost issue described in JR2? > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR6." In other functional aspects, diskless topics are > > > > >> >> >> >> indistinguishable > > > > >> >> >> >> >> from classic topics. This includes durability > > guarantees, > > > > >> >> ordering > > > > >> >> >> >> >> guarantees, transactional and non-transactional > producer > > > API, > > > > >> >> >> consumer > > > > >> >> >> >> >> API, > > > > >> >> >> >> >> consumer groups, share groups, data retention > (deletion > > & > > > > >> >> compact)," > > > > >> >> >> >> >> JR6.1 Could you describe how compact diskless topics > are > > > > >> >> supported? > > > > >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the > > > > >> transactional > > > > >> >> >> >> support in > > > > >> >> >> >> >> detail. > > > > >> >> >> >> >> > > > > >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, > and > > > > >> >> eventually > > > > >> >> >> >> >> consistent storage supporting arbitrary sized byte > > values > > > and > > > > >> a > > > > >> >> >> minimal > > > > >> >> >> >> >> set > > > > >> >> >> >> >> of atomic operations: put, delete, list, and ranged > > get." > > > > >> >> >> >> >> It seems that the object storage in all three major > > public > > > > >> clouds > > > > >> >> >> are > > > > >> >> >> >> >> strongly consistent. > > > > >> >> >> >> >> > > > > >> >> >> >> >> Jun > > > > >> >> >> >> >> > > > > >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko < > > > [email protected] > > > > >> > > > > > >> >> >> wrote: > > > > >> >> >> >> >> > > > > >> >> >> >> >> > Hi all, > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. > Let's > > > now > > > > >> >> focus on > > > > >> >> >> >> the > > > > >> >> >> >> >> > technical details presented in this KIP-1163 and > also > > in > > > > >> >> KIP-1164: > > > > >> >> >> >> >> Diskless > > > > >> >> >> >> >> > Coordinator [1]. > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > Best, > > > > >> >> >> >> >> > Ivan > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > [1] > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator > > > > >> < > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$ > > > > >> > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko > wrote: > > > > >> >> >> >> >> > > Hi all! > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > We want to start the discussion thread for > KIP-1163: > > > > >> Diskless > > > > >> >> >> Core > > > > >> >> >> >> >> [1], > > > > >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2]. > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for > > > > >> high-level > > > > >> >> >> >> >> questions, > > > > >> >> >> >> >> > motivation, and general direction of the feature and > > > this > > > > >> >> thread > > > > >> >> >> for > > > > >> >> >> >> >> > particular details of implementation. > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > Best, > > > > >> >> >> >> >> > > Ivan > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > [1] > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > > > > >> < > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$ > > > > >> > > > > > >> >> >> >> >> > > [2] > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > > >> < > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$ > > > > >> > > > > > >> >> >> >> >> > > [3] > > > > >> >> >> >> > > > https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d > > > > >> < > > > > > > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$ > > > > > > > > >> >> < > > > > >> > > > > > > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$ > > > > >> > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > >
