Jun & Satish, We can build the merging step to optimize WAL segments for more predictable rebuild times. But could we still perform a final move to Tiered Storage after each partition reaches the configured roll times? We could expect the same load/sizing expectations as classic topics (e.g. >1gb segments).
We are interested in unifying with Tiered Storage for many reasons, but also so that topics which have diskless mode dynamically enabled/disabled can eventually converge to a predictable state. Thanks, Greg On Wed, May 13, 2026, 3:56 AM Satish Duggana <[email protected]> wrote: > RLMM was not designed for aggressive copying of the latest data to > tiered storage by having small segment rollouts. > > +1 to Jun on leaving the existing RLMM for classic topics with tiered > storage and having an efficient metadata management system required > for diskless topics. > > > On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]> > wrote: > > > > Hi, Victor, > > > > Thanks for the reply. > > > > JR1. (A) and (B) Yes, your summary matches my thinking. > > (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed with > > an aggressive tiered storage consolidation (the first approach)". > > Hmm, I am confused by the above statement. By "the first approach", do > you > > mean aggressive tiering with faster segment rolling through the existing > > RLMM? I don't think the existing RLMM is designed to solve these issues > due > > to inefficiencies in cost, metadata propagation and metadata storage as > we > > previously discussed. > > > > JR11. I was thinking we leave the existing RLMM as is and continue to use > > it for classic topics. We design a new, more efficient metadata > management > > component independent of RLMM. This new component will be the only > metadata > > component that diskless topics depend on. > > > > Jun > > > > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <[email protected]> > > wrote: > > > > > Hi Jun, > > > > > > JR1 > > > (1)-(2)-(3) I'd address these together and let me explain our current > idea > > > to solve the tiny object problem because I'm not sure if we're 100% > talking > > > about the same thing. I have two approaches in mind for TS > consolidation > > > ((A) and (B)) and I'm not sure if we're both assuming the same idea, so > > > let's clarify this. > > > > > > (A) > > > This is our current assumption. This uses local disks (create classic > > > local logs with UnifiedLog) to consolidate logs into the classic log > format > > > and use RSM and RLMM to store them in tiered storage. This way we're > not > > > limited by the need to have short rollovers. Local logs become a form > of > > > staging environment to serve reads and accumulate records for tiered > > > storage. This means that: > > > (a) Once a message is consolidated into the classic log format, we can > > > use it for serving lagging consumers. Diskless reads should really be > used > > > for the head of the log and after a few seconds logs should be > consolidated. > > > (b) The real cost is much closer to that 87.5% (and in fact my google > > > sheet I shared also assumes this model) because we have more freedom in > > > choosing the retention parameters of the classic log. > > > (c) Metadata is smaller as we only need to keep diskless segments > until > > > the tiered offset surpasses the individual batches' offset. > > > (d) RLMM metadata is also somewhat manageable due to the larger > segment > > > sizes but it's still possible to run into the metadata explosion > problem. > > > (e) It needs to rebuild this local log on reassignment to serve > lagging > > > consumers effectively, so reassignment is a bit more messy. > > > (f) It's not optimal when partitions have a single replica: on > failure we > > > can only fall back to diskless mode until the partition is reassigned > to a > > > functioning broker. > > > > > > (B) > > > Compared to the above there can be an alternative approach, which is to > > > consolidate when diskless segments expire (after 15 minutes for > instance). > > > In that case your points seem to fit better as: > > > (a) we can only use the classic, consolidated logs to serve lagging > > > consumers after they have been tiered > > > (b) to be more efficient with lagging consumers we have to stick to a > > > short rollover > > > (c) it's more costly due to the short rollovers > > > (d) the RLMM bottleneck still exists due to the short rollovers > > > (e) it's not given whether we use local disks for transforming logs > as we > > > can do it in memory too (which can be ineffective and more expensive) > but > > > perhaps a “chunked transfer encoding” that S3 supports or similar with > > > other providers is a cost effective way. If we know the final size > advance, > > > we can upload data in chunks and still get billed for 1 put. > > > (f) more efficient reassignment or failover is cleaner and faster as > > > there isn't a need to rebuild local caches. > > > > > > (C) > > > Apart from the first 2 approaches there is a 3rd, which is WAL > merging. To > > > understand your points, let me summarize that I could gather so far as > > > reasons for WAL merging (and please correct me if I missed something): > > > (i) protecting consumer lag: small WAL files create inefficient > objects > > > for lagging consumers, so larger objects should be more efficient > > > (ii) avoiding the RLMM replay bottleneck: managing small segments with > > > RLMM is very inefficient (100s of GB metadata) > > > (iii) reducing batch metadata overhead: merging WAL files may reduce > the > > > metadata we need to store, but it depends on the merge algorithm and > how we > > > can compact batch data > > > (iv) cost effectiveness: retrieving merged WAL files reduces the > number > > > of get requests to object storage > > > (v) architectural redundancy with RLMM: ideally we wouldn't need 2 > > > solutions to 2 somewhat similar problems (tiered storage and diskless) > > > > > > Generally I think that (i) (ii) (iii) and (iv) may be addressed with an > > > aggressive tiered storage consolidation (the first approach), so the > only > > > remaining gap would be (v). I also agree that having 2 different > solutions > > > for metadata handling isn't ideal and perhaps there is a possibility of > > > improvement here. It should be possible to redesign RLMM to be more > similar > > > to the diskless coordinator or design a common solution. > > > > > > JR11 > > > "If we support merging in the diskless coordinator, I wonder how useful > > > RLMM > > > is. It seems simpler to manage all metadata from the object store in a > > > single place." > > > > > > Could you please clarify this a little bit? Do you think that we should > > > replace the RLMM with a solution that is more similar to the diskless > > > coordinator or deprecate tiered storage altogether in favor of > diskless? > > > I'm not sure which option you're referring: > > > (1) Unify tiered storage and diskless under a single storage layer > (and > > > possibly deprecate tiered storage in favor of diskless with merging WAL > > > segments). > > > (2) Create a smart coordinator instead of RLMM and possibly unify > > > metadata coordination with diskless. > > > (3) Keep tiered storage and diskless separate with their own solutions > > > for metadata (probably not optimal). > > > > > > Thanks, > > > Viktor > > > > > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected]> > > > wrote: > > > > > >> Hi, Viktor and Greg, > > >> > > >> Thanks for the reply. > > >> > > >> JR1. > > >> 1) Thanks for verifying the cost estimation. I noticed a bug in my > earlier > > >> calculation. I estimated the per broker network transfer rate at > 2MB/sec. > > >> It should be 4MB/sec. If I correct it, the estimated savings are > similar > > >> to > > >> yours. > > >> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = > $8* > > >> 10^-5 > > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings > are > > >> about 87.5%. > > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings > are > > >> 62.5%. > > >> Savings are still significantly lower when using RLMM. > > >> > > >> "To me it seems like that Greg's previous suggestion for a 15 min > rollover > > >> may be a bit too much. With 1 hour we can achieve better cost saving > and > > >> less coordinate metadata being stored." > > >> This solves the cost issue, but it has other implications (see point > 2) > > >> below). > > >> > > >> 2) "Yes, I think this is to be expected and a lot depends on the > > >> implementation. Ideally segments or chunks should be cached to > minimize > > >> the > > >> number of times segments pulled from remote storage." > > >> In a classic topic, when a consumer lags, its requests are served > either > > >> from the local cache or from large objects in the object store. With > the > > >> current design in a diskless topic, lagging consumer requests might be > > >> served from tiny 500-byte objects. This will significantly slow down > the > > >> consumer's catch-up, which is not expected user behavior. Ideally, we > > >> don't > > >> want those tiny objects to last more than a few minutes, let alone an > > >> hour. > > >> > > >> 3) "I think if my calculations are correct (and we use a 60 minute > > >> window), > > >> then metadata generation should be slower, please see the google > sheet I > > >> linked above. I think given that traffic, the current topic based RLMM > > >> should be able to handle it." > > >> Why is a 60 minute window used? RLMM metadata needs to be retained > for the > > >> longest retention time among all topics. This means that the retention > > >> window can be weeks instead of 1 hour. This means that RLMM might > need to > > >> replay over 100GB of data during reassignment, which is not what it is > > >> designed for. > > >> > > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline case, > > >> where > > >> there are some configurations which are not viable due to scale or > cost, > > >> and some that are. It would be up to the operator to tune their > cluster, > > >> by > > >> changing diskless.segment.ms > > >> < > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > >> < > > >> > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > >> >, > > >> dividing up the cluster, or switching to a more scalable RLMM > > >> implementation." > > >> A broker with 4MB/sec produce throughput can probably be considered > high > > >> throughput. Even with 4K partitions per broker, we could still > achieve an > > >> 87.5% cost saving as listed above, if we do the right implementation. > So, > > >> ideally, it would be useful to support that as well. > > >> > > >> JR11. "We had a short conversation with Greg and we came to the > conclusion > > >> that because of the explosiveness of diskless metadata, it may be > worth > > >> revisiting the merging case as it can indeed buy us some more cost > saving > > >> for the added complexity. " > > >> If we support merging in the diskless coordinator, I wonder how useful > > >> RLMM > > >> is. It seems simpler to manage all metadata from the object store in a > > >> single place. > > >> > > >> Jun > > >> > > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> > wrote: > > >> > > >> > Hi Jun, > > >> > > > >> > Thank you for scrutinizing the scalability of the current > > >> > direct-to-tiered-storage strategy, and its metadata scalability. > > >> > > > >> > One of our implicit assumptions with this design was that users are > able > > >> > to choose between the Diskless and Classic mechanisms, and that any > > >> > situations where the Diskless design was deficient, the Classic > topics > > >> > could continue to be used. > > >> > This was originally applied to low-latency use-cases, but now also > > >> applies > > >> > to low-throughput use-cases too. When the throughput on a topic is > low, > > >> the > > >> > benefit of using Diskless is also low, because it is proportional > to the > > >> > amount of data transferred, and it is more likely that the batch > > >> overhead > > >> > of the topics is significant. > > >> > In other words, we've been treating cost-effective support for > > >> arbitrarily > > >> > low throughput topics as a non-goal. > > >> > > > >> > Your example of 100,000 1kb/s partitions is a borderline case, where > > >> there > > >> > are some configurations which are not viable due to scale or cost, > and > > >> some > > >> > that are. It would be up to the operator to tune their cluster, by > > >> changing > > >> > diskless.segment.ms > > >> < > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > >> > < > > >> > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > >> >, > > >> > dividing up the cluster, or switching to a more scalable RLMM > > >> > implementation. > > >> > > > >> > Do you think we should have cost-effective support for arbitrarily > > >> > low-throughput partitions in Diskless? How much total demand is > there in > > >> > partitions where batches are >1kb but the partition throughput is > > >> <1kb/s? > > >> > > > >> > Thanks, > > >> > Greg > > >> > > > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass < > [email protected] > > >> > > > >> > wrote: > > >> > > > >> >> Hi Jun, > > >> >> > > >> >> Regarding JR1. > > >> >> We had a short conversation with Greg and we came to the conclusion > > >> that > > >> >> because of the explosiveness of diskless metadata, it may be worth > > >> >> revisiting the merging case as it can indeed buy us some more cost > > >> saving > > >> >> for the added complexity. Also, it would support smaller topics > and we > > >> >> could somewhat manage the tiered storage consolidation costs. I > think > > >> that > > >> >> we would still need to consolidate WAL segments into tiered > storage. > > >> >> Reasons are: to limit WAL metadata, to be able to dynamically > > >> >> enable/disable diskless and to be compatible with existing and > future > > >> TS > > >> >> improvements. > > >> >> I'll try to refresh KIP-1165 and build it into the calculator > above (if > > >> >> it's possible at all :) ) and come back to you. > > >> >> Regardless, I just wanted to give a short update in the meantime, > > >> looking > > >> >> forward to your answer. > > >> >> > > >> >> Best, > > >> >> Viktor > > >> >> > > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass < > > >> >> [email protected]> > > >> >> wrote: > > >> >> > > >> >> > Hi Jun, > > >> >> > > > >> >> > Thanks for the quick reply. > > >> >> > > > >> >> > JR1. > > >> >> > 1) Thanks for putting the numbers together. While your > calculation > > >> >> > seems to be correct in the sense that 6 PUTs would worsen the > cost > > >> >> saving > > >> >> > benefits, I think that in a byte for byte comparison there is a > > >> bigger > > >> >> > difference. The reason is that the 4 tiered storage puts transfer > > >> much > > >> >> more > > >> >> > data compared to the small WAL segments, so in practice there > should > > >> be > > >> >> > fewer TS puts. > > >> >> > I made a google sheet calculator for this which I'd like to share > > >> with > > >> >> > you: > > >> >> > > > >> >> > > >> > https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906 > > >> < > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$ > > >> > > > >> >> > Please copy the sheet to modify the values. > > >> >> > About my findings: I was trying to create a similar cluster model > > >> that > > >> >> has > > >> >> > been discussed here previously to see how cost varies over > different > > >> >> > segment rollovers.To me it seems like that Greg's previous > suggestion > > >> >> for a > > >> >> > 15 min rollover may be a bit too much. With 1 hour we can achieve > > >> better > > >> >> > cost saving and less coordinate metadata being stored. I have > also > > >> >> tried to > > >> >> > account for the producer batch metadata generated by diskless > > >> partitions > > >> >> > but to me it seems like a lower number than Greg's original > numbers. > > >> >> > > > >> >> > 2) "Note that local storage could be lost on reassigned > partitions. > > >> In > > >> >> > that case, lagging reads can only be served from the object > store." > > >> >> > Yes, I think this is to be expected and a lot depends on the > > >> >> > implementation. Ideally segments or chunks should be cached to > > >> minimize > > >> >> the > > >> >> > number of times segments pulled from remote storage. > > >> >> > > > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on the > > >> broker > > >> >> > instance type, a broker may only be able to handle low 10s of > MB/sec > > >> of > > >> >> > data. So, 2MB/sec overhead is significant." > > >> >> > Yes, I have indeed misunderstood, however I have updated my > > >> calculator > > >> >> > sheet with metadata calculation. Overall, the number of tiered > > >> storage > > >> >> > segments created seems to be much lower than in your calculations > > >> given > > >> >> the > > >> >> > parameters of the cluster you specified earlier. Please take a > look, > > >> I'd > > >> >> > like to really understand the thinking here because this is a > crucial > > >> >> point. > > >> >> > > > >> >> > 3) I think if my calculations are correct (and we use a 60 minute > > >> >> window), > > >> >> > then metadata generation should be slower, please see the google > > >> sheet I > > >> >> > linked above. I think given that traffic, the current topic based > > >> RLMM > > >> >> > should be able to handle it. > > >> >> > In the case where we would need to make the RLMM capable of > handling > > >> a > > >> >> > similar traffic as the diskless coordinator, then you're right, > we > > >> >> probably > > >> >> > should consider how we can improve it. I think there are multiple > > >> >> > possibilities as you mentioned, but ideally there should be a > common > > >> >> > implementation for metadata coordination that could handle these > > >> cases. > > >> >> > > > >> >> > JR7. > > >> >> > Yes, your expectation is totally reasonable, we should expect > the get > > >> >> and > > >> >> > put operations to be strongly consistent for the read-after-write > > >> >> > scenarios. And I think that since major cloud providers give > strongly > > >> >> > consistent object storages, it should be sufficient for a wide > > >> >> user-group. > > >> >> > So we could shrink the scope of the KIP a bit this way and avoid > > >> adding > > >> >> > complexity that is needed mostly on the margin. > > >> >> > I can expect though that "list" can stay eventually consistent > as the > > >> >> KIP > > >> >> > relies on it for only garbage collection where it is fine if a > few > > >> >> segments > > >> >> > can be collected only in the next iteration. > > >> >> > > > >> >> > JR3. > > >> >> > Since Greg hasn't replied yet, I'll try to catch up with him and > > >> >> formulate > > >> >> > an answer next week. > > >> >> > > > >> >> > Best, > > >> >> > Viktor > > >> >> > > > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev < > > >> [email protected]> > > >> >> > wrote: > > >> >> > > > >> >> >> Hi, Victor, > > >> >> >> > > >> >> >> Thanks for the reply. > > >> >> >> > > >> >> >> JR1. > > >> >> >> 1) "So while it seems to be significant that we tripled the > number > > >> of > > >> >> >> PUTs, cost-wise it doesn't seem to be significant." > > >> >> >> Let's compare the savings achieved by replacing network > replication > > >> >> >> transfer with S3 puts in AWS. > > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB > > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request > > >> >> >> > > >> >> >> The KIP batches data up to 4MB. So, let's assume that we write > 2MB > > >> S3 > > >> >> >> objects on average. > > >> >> >> > > >> >> >> The cost for transferring 2MB through the network is 2 * 2 * > 10^-5 = > > >> >> $4* > > >> >> >> 10^-5 > > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The > savings > > >> >> are > > >> >> >> about 75%. > > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The > savings > > >> >> are > > >> >> >> 25%. As you can see, the savings are significantly lower. > > >> >> >> > > >> >> >> 2) "Therefore we could expect classic local segments to be > present > > >> >> which > > >> >> >> could be used for catching up consumers." > > >> >> >> Note that local storage could be lost on reassigned partitions. > In > > >> that > > >> >> >> case, lagging reads can only be served from the object store. > > >> >> >> > > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the > 2GB/s > > >> >> >> throughput that Greg calculated previously, so I think it > should be > > >> >> >> manageable for a cluster with that amount of throughput," > > >> >> >> It seems that you didn't make the correct comparison. 2GB/s that > > >> Greg > > >> >> >> mentioned is the throughput for the whole cluster. The 2MB/sec I > > >> >> quoted is > > >> >> >> for a specific broker. Depending on the broker instance type, a > > >> broker > > >> >> may > > >> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec > > >> overhead > > >> >> is > > >> >> >> significant. > > >> >> >> > > >> >> >> 3) "I'd separate it from the discussion of diskless core and > > >> perhaps we > > >> >> >> could address it in a separate KIP as it is mostly a redesign > of the > > >> >> >> RLMM." > > >> >> >> Those problems don't exist in the existing usage of RLMM. They > > >> manifest > > >> >> >> because diskless tries to use RLMM in a way it wasn't designed > for > > >> >> (there > > >> >> >> is at least a 20X increase in metadata). It would be useful to > > >> consider > > >> >> >> whether fixing those problems in RLMM or using a new approach is > > >> >> >> better. For example, KIP-1164 already introduces a snapshotting > > >> >> mechanism. > > >> >> >> Adding another snapshotting mechanism to RLMM seems redundant. > > >> >> >> > > >> >> >> JR7. A typical object store supports 3 operations: puts, gets > and > > >> >> lists. > > >> >> >> Which operations used by diskless can be eventually consistent? > I'd > > >> >> expect > > >> >> >> that get should always see the result of the latest put. > > >> >> >> > > >> >> >> Jun > > >> >> >> > > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass < > > >> [email protected] > > >> >> > > > >> >> >> wrote: > > >> >> >> > > >> >> >> > Hi Jun, > > >> >> >> > > > >> >> >> > I'd like to add my thoughts too until Greg has time to > respond. > > >> >> >> > > > >> >> >> > JR1. I also think there are shortcomings in the current tiered > > >> >> storage > > >> >> >> > design, around the RLMM. > > >> >> >> > 1) I think this is a correct observation, however if my > > >> calculations > > >> >> are > > >> >> >> > correct, it actually comes down to a negligible amount of > cost. > > >> >> Taking > > >> >> >> the > > >> >> >> > AWS pricing sheet at > > >> >> >> > > > >> >> >> > > >> >> > > >> > https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps > > >> < > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$ > > >> > > > >> >> >> > it seems like the difference between 6 or 2 PUTs per second is > > >> ~$52 > > >> >> for > > >> >> >> a > > >> >> >> > month. The calculation follows > > >> >> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. > So > > >> >> while > > >> >> >> it > > >> >> >> > seems to be significant that we tripled the number of PUTs, > > >> >> cost-wise it > > >> >> >> > doesn't seem to be significant. > > >> >> >> > 2) Reflecting to your original problem: the tiered storage > > >> >> consolidation > > >> >> >> > process should be continuously running and transforming WAL > > >> segments > > >> >> >> into > > >> >> >> > classic logs. Therefore we could expect classic local > segments to > > >> be > > >> >> >> > present which could be used for catching up consumers. So they > > >> would > > >> >> >> only > > >> >> >> > switch to WAL reading when they're close to the end of the > log. > > >> Since > > >> >> >> this > > >> >> >> > offset space should be cached, the reads from there should be > > >> fast. > > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below the > 2GB/s > > >> >> >> > throughput that Greg calculated previously, so I think it > should > > >> be > > >> >> >> > manageable for a cluster with that amount of throughput, > although > > >> I > > >> >> >> agree > > >> >> >> > with your comment that the current topic based tiered metadata > > >> >> manager > > >> >> >> > isn't optimal and we could develop a better solution. > > >> >> >> > 3) Tied to the previous point, I agree that your comments are > > >> >> absolutely > > >> >> >> > valid, however similarly to that, I'd separate it from the > > >> >> discussion of > > >> >> >> > diskless core and perhaps we could address it in a separate > KIP as > > >> >> it is > > >> >> >> > mostly a redesign of the RLMM. > > >> >> >> > > > >> >> >> > JR2. Ack. We will raise a KIP in the near future. > > >> >> >> > > > >> >> >> > JR3. I'd leave answering this to Greg as I don't have too much > > >> >> context > > >> >> >> on > > >> >> >> > this one. > > >> >> >> > > > >> >> >> > JR7. I think this could be similar to the tiered storage > design, > > >> so > > >> >> any > > >> >> >> > coordinator operation should be strongly consistent (since > we're > > >> >> using > > >> >> >> > classic topics there). Therefore the WAL segment storage layer > > >> could > > >> >> be > > >> >> >> > eventually consistent as we store its metadata in a strongly > > >> >> consistent > > >> >> >> > manner. I'm not sure though if this was the answer you're > looking > > >> >> for? > > >> >> >> > > > >> >> >> > Best, > > >> >> >> > Viktor > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev < > > >> >> [email protected]> > > >> >> >> > wrote: > > >> >> >> > > > >> >> >> >> Hi, Greg, > > >> >> >> >> > > >> >> >> >> Thanks for the reply. > > >> >> >> >> > > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3 > > >> concerns > > >> >> I > > >> >> >> >> listed, but it introduces some new issues because it doesn't > > >> quite > > >> >> fit > > >> >> >> the > > >> >> >> >> design of the current tiered storage. (a) The current tiered > > >> storage > > >> >> >> >> design > > >> >> >> >> stores a single partition per object. If we roll a log > segment > > >> >> every 15 > > >> >> >> >> minutes, with 4K partitions per broker, this means an > additional > > >> 4 > > >> >> S3 > > >> >> >> puts > > >> >> >> >> per second. The diskless design aims for 2 S3 puts per > second. > > >> So, > > >> >> this > > >> >> >> >> triples the S3 put cost and reduces the savings benefits. (b) > > >> With > > >> >> Tier > > >> >> >> >> storage, each broker essentially needs to read the tier > metadata > > >> >> from > > >> >> >> all > > >> >> >> >> tier metadata partitions if the number of user partitions > exceeds > > >> >> 50. > > >> >> >> >> Assuming that we generate 100 bytes of tier metadata per > > >> partition > > >> >> >> every > > >> >> >> >> 15 > > >> >> >> >> minutes. Assuming that each broker has 4K partitions and a > > >> cluster > > >> >> of > > >> >> >> 500 > > >> >> >> >> brokers. Each broker needs to receive tier metadata at a > rate of > > >> >> 100 * > > >> >> >> 4K > > >> >> >> >> * > > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the > 50 > > >> tier > > >> >> >> >> metadata topic partitions, it needs to send out metadata at > 100 * > > >> >> 4K * > > >> >> >> 500 > > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary > > >> network > > >> >> >> and > > >> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A > > >> >> restarted > > >> >> >> >> broker needs to replay the tier metadata log from the > beginning > > >> to > > >> >> >> build > > >> >> >> >> the tier metadata state. Suppose that the tier metadata log > is > > >> kept > > >> >> >> for 7 > > >> >> >> >> days. The total amount of tier metadata that needs to be > > >> replayed is > > >> >> >> 200KB > > >> >> >> >> * 7 * 24 * 3600 = 120GB. > > >> >> >> >> Does the merging optimization you mentioned address those new > > >> >> >> concerns? If > > >> >> >> >> so, could you describe how it works? > > >> >> >> >> > > >> >> >> >> JR2. It's fine to cover the default partition assignment > strategy > > >> >> for > > >> >> >> >> diskless topics in a separate KIP. However, since this is > > >> essential > > >> >> for > > >> >> >> >> achieving the cost saving goal, we need a solution before > > >> releasing > > >> >> the > > >> >> >> >> diskless KIP. > > >> >> >> >> > > >> >> >> >> JR3. Sounds good. Could you document how this work? > > >> >> >> >> > > >> >> >> >> JR7. Could you describe which parts of the operation can be > > >> >> eventually > > >> >> >> >> consistent? > > >> >> >> >> > > >> >> >> >> Jun > > >> >> >> >> > > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris < > > >> [email protected]> > > >> >> >> wrote: > > >> >> >> >> > > >> >> >> >> > Hi Jun, > > >> >> >> >> > > > >> >> >> >> > Thanks for your comments! > > >> >> >> >> > > > >> >> >> >> > JR1: > > >> >> >> >> > You are correct that the segment rolling configurations are > > >> >> currently > > >> >> >> >> > critical to balance the scalability of Diskless and Tiered > > >> >> Storage, > > >> >> >> as > > >> >> >> >> > larger roll configurations benefit tiered storage, and > smaller > > >> >> roll > > >> >> >> >> > configurations benefit Diskless. > > >> >> >> >> > > > >> >> >> >> > To address your points specifically: > > >> >> >> >> > (1) A Diskless topic which is cost-competitive with an > > >> equivalent > > >> >> >> >> Classic > > >> >> >> >> > topic will have a metadata size <1% of the data size. A > cluster > > >> >> >> storing > > >> >> >> >> > 360GB of metadata will have >36TB of data under management > and > > >> a > > >> >> >> >> retention > > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require > > >> multiple > > >> >> >> >> Diskless > > >> >> >> >> > coordinators, which can share the load of storing the > Diskless > > >> >> >> metadata, > > >> >> >> >> > and serving Diskless requests. > > >> >> >> >> > (2) Catching up consumers are intended to be served from > tiered > > >> >> >> storage > > >> >> >> >> > and local segment caches. Brokers which are building their > > >> local > > >> >> >> segment > > >> >> >> >> > caches will have to read many files, but will amortize > those > > >> >> reads by > > >> >> >> >> > receiving data for multiple partitions in a single read. > > >> >> >> >> > (3) This is a fundamental downside of storing data from > > >> multiple > > >> >> >> topics > > >> >> >> >> in > > >> >> >> >> > a single object, similar to classic segments. We can > implement > > >> a > > >> >> >> >> > configurable cluster-wide maximum roll time, which would > set > > >> the > > >> >> >> slowest > > >> >> >> >> > cadence at which Tiered Storage segments are rolled from > > >> Diskless > > >> >> >> >> segments. > > >> >> >> >> > If an individual partition has more aggressive roll > settings, > > >> it > > >> >> may > > >> >> >> be > > >> >> >> >> > rolled earlier. > > >> >> >> >> > This configuration would permit the cluster operator to > > >> >> approximately > > >> >> >> >> > bound the number of diskless WAL segments, which bounds the > > >> total > > >> >> >> size > > >> >> >> >> of > > >> >> >> >> > the WAL segments, disk cache, diskless coordinator state, > and > > >> >> >> excessive > > >> >> >> >> > retention window. For example, a diskless.segment.ms > > >> < > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > >> >> < > > >> > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > >> > > > >> >> of 15 minutes > > >> >> >> >> would > > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to > 1.8TB, and > > >> >> >> permit > > >> >> >> >> > short-retention data to be physically deleted as soon as > ~15 > > >> >> minutes > > >> >> >> >> after > > >> >> >> >> > being produced. > > >> >> >> >> > Of course, this will reduce the size of the tiered storage > > >> >> segments > > >> >> >> for > > >> >> >> >> > topics that have low throughput, and where segment.ms > > >> < > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > >> >> < > > >> > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > >> > > > >> >> > > > >> >> >> >> > diskless.segment.ms > > >> < > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > >> >> < > > >> > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > >> >, > > >> >> increasing overhead in the RLMM. We can perform > > >> >> >> >> > merging/optimization of Tiered Storage segments to achieve > the > > >> >> >> per-topic > > >> >> >> >> > segment.ms > > >> < > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > >> >> < > > >> > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > >> > > > >> >> . > > >> >> >> >> > There were some reasons why we retracted the prior > file-merging > > >> >> >> >> approach, > > >> >> >> >> > and why merging in tiered storage appears better: > > >> >> >> >> > * Rewriting files requires mutability for existing data, > which > > >> >> adds > > >> >> >> >> > complexity. Diskless batches or Remote Log Segments would > need > > >> to > > >> >> be > > >> >> >> >> made > > >> >> >> >> > mutable, and the remote log will be made mutable in > KIP-1272 > > >> [1] > > >> >> >> >> > * Because a WAL Segment can contain batches from multiple > > >> Diskless > > >> >> >> >> > Coordinators, multiple coordinators must also be involved > in > > >> the > > >> >> >> merging > > >> >> >> >> > step. The Tiered Storage design has exclusive ownership for > > >> remote > > >> >> >> log > > >> >> >> >> > segments within the RLMM. > > >> >> >> >> > * Diskless file merging competes for resources with > > >> >> latency-sensitive > > >> >> >> >> > producers and hot consumers. Tiered storage file merging > > >> competes > > >> >> for > > >> >> >> >> > resources with lagging consumers, which are typically less > > >> latency > > >> >> >> >> > sensitive. > > >> >> >> >> > * Implementing merging in Tiered Storage allows this > > >> optimization > > >> >> to > > >> >> >> >> > benefit both classic topics and diskless topics, covering > both > > >> >> high > > >> >> >> and > > >> >> >> >> low > > >> >> >> >> > throughput partitions. > > >> >> >> >> > * Remote log segments may be optimized over much longer > time > > >> >> windows > > >> >> >> >> > rather than performing optimization once in the first few > > >> hours of > > >> >> >> the > > >> >> >> >> life > > >> >> >> >> > of a WAL segment and then freezing the arrangement of the > data > > >> >> until > > >> >> >> it > > >> >> >> >> is > > >> >> >> >> > deleted. > > >> >> >> >> > * File merging will need to rely on heuristics, which > should be > > >> >> >> >> > configurable by the user. Multi-partition heuristics are > more > > >> >> >> >> complicated > > >> >> >> >> > to describe and reason about than single-partition > heuristics. > > >> >> >> >> > What do you think of this alternative? > > >> >> >> >> > > > >> >> >> >> > JR2: > > >> >> >> >> > Yes, the current default partition assignment strategy will > > >> need > > >> >> some > > >> >> >> >> > improvement. This problem with Diskless WAL segments is > > >> analogous > > >> >> to > > >> >> >> the > > >> >> >> >> > Classic topics’ dense inter-broker connection graph. > > >> >> >> >> > The natural solution to this seems to be some sort of > cellular > > >> >> >> design, > > >> >> >> >> > where the replica placements tend to locate partitions in > > >> similar > > >> >> >> >> groups. > > >> >> >> >> > Partitions in the same cell can generally share the same > WAL > > >> >> Segments > > >> >> >> >> and > > >> >> >> >> > the same Diskless Coordinator requests. This would also > benefit > > >> >> >> Classic > > >> >> >> >> > topics, which would need fewer connections and fetch > requests. > > >> >> >> >> > Such a feature is out-of-scope of this KIP, and either we > will > > >> >> >> publish a > > >> >> >> >> > follow-up KIP, or let operators and community tooling > address > > >> >> this. > > >> >> >> >> > > > >> >> >> >> > JR3: > > >> >> >> >> > Yes we will replace the ISR/ELR election logic for diskless > > >> >> topics, > > >> >> >> as > > >> >> >> >> > they no longer rely on replicas for data integrity. We will > > >> fully > > >> >> >> model > > >> >> >> >> the > > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and > choose > > >> how > > >> >> we > > >> >> >> >> > display this to clients. > > >> >> >> >> > For backwards compatibility, clients using older metadata > > >> requests > > >> >> >> >> should > > >> >> >> >> > see diskless topics, but interpret them as classic topics. > We > > >> >> could > > >> >> >> tell > > >> >> >> >> > older clients that the leader is in the ISR, even if it > just > > >> >> started > > >> >> >> >> > building its cache. > > >> >> >> >> > For clients using the latest metadata, they should see the > true > > >> >> >> state of > > >> >> >> >> > the diskless partition: which nodes can accept > > >> >> >> produce/fetch/sharefetch > > >> >> >> >> > requests, which ranges of offsets are cached on-broker, > etc. > > >> This > > >> >> >> could > > >> >> >> >> > also be used to break apart the “leader” field into more > > >> granular > > >> >> >> >> fields, > > >> >> >> >> > now that leadership has changed meaning. > > >> >> >> >> > > > >> >> >> >> > JR4: > > >> >> >> >> > Yes, we can replace the empty fetch requests to the leader > > >> nodes > > >> >> with > > >> >> >> >> > cache hint fields in the requests to the Diskless > Coordinator, > > >> and > > >> >> >> rely > > >> >> >> >> on > > >> >> >> >> > the coordinator to distribute cache hints to all replicas. > This > > >> >> >> should > > >> >> >> >> be > > >> >> >> >> > low-overhead, and eliminate the inter-broker communication > for > > >> >> >> brokers > > >> >> >> >> > which only host Diskless topics. > > >> >> >> >> > > > >> >> >> >> > JR5.1: > > >> >> >> >> > You are correct and this text was ambiguous, only > specifying > > >> that > > >> >> the > > >> >> >> >> > controller waits for the sync to be complete. This section > is > > >> now > > >> >> >> >> updated > > >> >> >> >> > to explicitly say that local segments are built from object > > >> >> storage. > > >> >> >> >> > > > >> >> >> >> > JR5.2: > > >> >> >> >> > Extending the JR2 discussion, reassignment of diskless > topics > > >> >> would > > >> >> >> >> > generally happen within a cell, where the marginal cost of > > >> >> reading an > > >> >> >> >> > additional partition is very low. When cells are > re-balanced > > >> and a > > >> >> >> >> > partition is migrated between cells, there is a brief time > > >> (until > > >> >> the > > >> >> >> >> next > > >> >> >> >> > Tiered Storage segment roll) when the marginal cost is > doubled. > > >> >> This > > >> >> >> >> should > > >> >> >> >> > be infrequent and well-amortized by other topics which > aren’t > > >> >> being > > >> >> >> >> > re-balanced between cells. > > >> >> >> >> > > > >> >> >> >> > JR6.1: > > >> >> >> >> > We plan to move data from Diskless to Tiered Storage. Once > the > > >> >> data > > >> >> >> is > > >> >> >> >> in > > >> >> >> >> > Tiered Storage, it can be compacted using the functionality > > >> >> >> described in > > >> >> >> >> > KIP-1272 [1] > > >> >> >> >> > > > >> >> >> >> > JR6.2: > > >> >> >> >> > We will add details for this soon. > > >> >> >> >> > > > >> >> >> >> > JR7: > > >> >> >> >> > We specify the requirement of eventual consistency to allow > > >> >> Diskless > > >> >> >> >> > Topics to be used with other object storage implementations > > >> which > > >> >> >> aren’t > > >> >> >> >> > the three major public clouds, such as self-managed > software or > > >> >> >> weaker > > >> >> >> >> > consistency caches. > > >> >> >> >> > > > >> >> >> >> > Thanks, > > >> >> >> >> > Greg > > >> >> >> >> > > > >> >> >> >> > [1] > > >> >> >> >> > > > >> >> >> >> > > >> >> >> > > >> >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage > > >> < > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$ > > >> > > > >> >> >> >> > > > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev < > > >> >> [email protected] > > >> >> >> > > > >> >> >> >> > wrote: > > >> >> >> >> > > > >> >> >> >> >> Hi, Ivan, > > >> >> >> >> >> > > >> >> >> >> >> Thanks for the KIP. A few comments below. > > >> >> >> >> >> > > >> >> >> >> >> JR1. I am concerned about the usage of the current tiered > > >> >> storage to > > >> >> >> >> >> control the number of small WAL files. Current tiered > storage > > >> >> only > > >> >> >> >> tiers > > >> >> >> >> >> the data when a segment rolls, which can take hours. This > > >> causes > > >> >> >> three > > >> >> >> >> >> problems. (1) Much more metadata needs to be stored and > > >> >> maintained, > > >> >> >> >> which > > >> >> >> >> >> increases the cost. Suppose that each segment rolls every > 5 > > >> >> hours, > > >> >> >> each > > >> >> >> >> >> partition generates 2 WAL files per second and each WAL > file's > > >> >> >> metadata > > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * > 2 * > > >> 100 > > >> >> = > > >> >> >> >> 3.6MB > > >> >> >> >> >> of > > >> >> >> >> >> metadata. In a cluster with 100K partitions, this > translates > > >> to > > >> >> >> 360GB > > >> >> >> >> of > > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A > > >> catching-up > > >> >> >> >> consumer's > > >> >> >> >> >> performance degrades since it's forced to read data from > many > > >> >> small > > >> >> >> WAL > > >> >> >> >> >> files. (3) The data in WAL files could be retained much > longer > > >> >> than > > >> >> >> >> >> retention time. Since the small WAL files aren't > completely > > >> >> deleted > > >> >> >> >> until > > >> >> >> >> >> all partitions' data in it are obsolete, the deletion of > the > > >> WAL > > >> >> >> files > > >> >> >> >> >> could be delayed by hours or more. If the WAL file > includes a > > >> >> >> partition > > >> >> >> >> >> with a low retention time, the retention contract could be > > >> >> violated > > >> >> >> >> >> significantly. The earlier design of the KIP included a > > >> separate > > >> >> >> object > > >> >> >> >> >> merging process that combines small WAL files much more > > >> >> aggressively > > >> >> >> >> than > > >> >> >> >> >> tiered storage, which seems to be a much better choice. > > >> >> >> >> >> > > >> >> >> >> >> JR2. I don't think the current default partition > assignment > > >> >> strategy > > >> >> >> >> for > > >> >> >> >> >> classic topics works for diskless topics. Current strategy > > >> tries > > >> >> to > > >> >> >> >> spread > > >> >> >> >> >> the replicas to as many brokers as possible. For example, > if a > > >> >> >> broker > > >> >> >> >> has > > >> >> >> >> >> 100 partitions, their replicas could be spread over 100 > > >> brokers. > > >> >> If > > >> >> >> the > > >> >> >> >> >> broker generates a WAL file with 100 partitions, this WAL > file > > >> >> will > > >> >> >> be > > >> >> >> >> >> read > > >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of > the > > >> cost > > >> >> of > > >> >> >> S3 > > >> >> >> >> >> put. > > >> >> >> >> >> This assignment strategy will increase the S3 cost by > about > > >> 8X, > > >> >> >> which > > >> >> >> >> is > > >> >> >> >> >> prohibitive. We need to design a cost effective assignment > > >> >> strategy > > >> >> >> for > > >> >> >> >> >> diskless topics. > > >> >> >> >> >> > > >> >> >> >> >> JR3. We need to think through the leade election logic > with > > >> >> diskless > > >> >> >> >> >> topic. > > >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, but > it > > >> >> doesn't > > >> >> >> >> seem > > >> >> >> >> >> very natural. > > >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In > the > > >> >> >> diskless > > >> >> >> >> >> topic, the KIP says that a leader could be out of sync. > > >> >> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR > > >> mainly > > >> >> >> >> retries > > >> >> >> >> >> to > > >> >> >> >> >> preserve previously acknowledged data. With diskless > topics, > > >> >> since > > >> >> >> the > > >> >> >> >> >> object store provides durability, this logic seems no > longer > > >> >> needed. > > >> >> >> >> The > > >> >> >> >> >> existing min.isr and unclean leader election logic also > don't > > >> >> apply. > > >> >> >> >> >> > > >> >> >> >> >> JR4. "Despite that there is no inter-broker replication, > > >> replicas > > >> >> >> will > > >> >> >> >> >> still issue FetchRequest to leaders. Leaders will respond > with > > >> >> empty > > >> >> >> >> (no > > >> >> >> >> >> records) FetchResponse." > > >> >> >> >> >> This seems unnatural. Could we avoid issuing inter broker > > >> fetch > > >> >> >> >> requests > > >> >> >> >> >> for diskless topics? > > >> >> >> >> >> > > >> >> >> >> >> JR5. "The replica reassignment will follow the same flow > as in > > >> >> >> classic > > >> >> >> >> >> topic:". > > >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is > alway > > >> >> >> empty, > > >> >> >> >> it > > >> >> >> >> >> doesn't seem the current reassignment flow works for > diskless > > >> >> topic. > > >> >> >> >> Also, > > >> >> >> >> >> since the source of the data is object store, it seems > more > > >> >> natural > > >> >> >> >> for a > > >> >> >> >> >> replica to back fill the data from the object store, > instead > > >> of > > >> >> >> other > > >> >> >> >> >> replicas. This will also incur lower costs. > > >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics > from > > >> >> causing > > >> >> >> >> the > > >> >> >> >> >> same cost issue described in JR2? > > >> >> >> >> >> > > >> >> >> >> >> JR6." In other functional aspects, diskless topics are > > >> >> >> >> indistinguishable > > >> >> >> >> >> from classic topics. This includes durability guarantees, > > >> >> ordering > > >> >> >> >> >> guarantees, transactional and non-transactional producer > API, > > >> >> >> consumer > > >> >> >> >> >> API, > > >> >> >> >> >> consumer groups, share groups, data retention (deletion & > > >> >> compact)," > > >> >> >> >> >> JR6.1 Could you describe how compact diskless topics are > > >> >> supported? > > >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the > > >> transactional > > >> >> >> >> support in > > >> >> >> >> >> detail. > > >> >> >> >> >> > > >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and > > >> >> eventually > > >> >> >> >> >> consistent storage supporting arbitrary sized byte values > and > > >> a > > >> >> >> minimal > > >> >> >> >> >> set > > >> >> >> >> >> of atomic operations: put, delete, list, and ranged get." > > >> >> >> >> >> It seems that the object storage in all three major public > > >> clouds > > >> >> >> are > > >> >> >> >> >> strongly consistent. > > >> >> >> >> >> > > >> >> >> >> >> Jun > > >> >> >> >> >> > > >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko < > [email protected] > > >> > > > >> >> >> wrote: > > >> >> >> >> >> > > >> >> >> >> >> > Hi all, > > >> >> >> >> >> > > > >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's > now > > >> >> focus on > > >> >> >> >> the > > >> >> >> >> >> > technical details presented in this KIP-1163 and also in > > >> >> KIP-1164: > > >> >> >> >> >> Diskless > > >> >> >> >> >> > Coordinator [1]. > > >> >> >> >> >> > > > >> >> >> >> >> > Best, > > >> >> >> >> >> > Ivan > > >> >> >> >> >> > > > >> >> >> >> >> > [1] > > >> >> >> >> >> > > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator > > >> < > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$ > > >> > > > >> >> >> >> >> > > > >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote: > > >> >> >> >> >> > > Hi all! > > >> >> >> >> >> > > > > >> >> >> >> >> > > We want to start the discussion thread for KIP-1163: > > >> Diskless > > >> >> >> Core > > >> >> >> >> >> [1], > > >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2]. > > >> >> >> >> >> > > > > >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for > > >> high-level > > >> >> >> >> >> questions, > > >> >> >> >> >> > motivation, and general direction of the feature and > this > > >> >> thread > > >> >> >> for > > >> >> >> >> >> > particular details of implementation. > > >> >> >> >> >> > > > > >> >> >> >> >> > > Best, > > >> >> >> >> >> > > Ivan > > >> >> >> >> >> > > > > >> >> >> >> >> > > [1] > > >> >> >> >> >> > > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > > >> < > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$ > > >> > > > >> >> >> >> >> > > [2] > > >> >> >> >> >> > > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > >> < > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$ > > >> > > > >> >> >> >> >> > > [3] > > >> >> >> >> > https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d > > >> < > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$ > > > > >> >> < > > >> > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$ > > >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > >> >> >> >> > > > >> >> >> >> > > >> >> >> > > > >> >> >> > > >> >> > > > >> >> > > >> > > > >> > > > >
