Hi all, Thank you for the extensive feedback.
We have substantially updated the KIP to address the points raised. Given the scope of these changes, we have started a new VOTE thread to restart the voting process cleanly. You can find the new thread here: https://lists.apache.org/thread/m42nj2qm5z4w2x8kt7x4kgghfzrdwl7q Best, Anatolii On Mon, Dec 15, 2025 at 9:47 PM Thomas Thornton via dev < [email protected]> wrote: > Hi all, > > We (the team at Slack) have been following the recent discussion regarding > the scope and timeline of KIP-1150. We agree with the community sentiment > that Diskless Topics represents the right long-term architecture for Kafka > in the cloud, but we also recognize the valid concerns raised regarding the > engineering resources required to deliver such an ambitious change. > > To help address these concerns and accelerate the timeline, we are happy to > announce that we are partnering with the KIP-1150 authors to co-develop > this feature. > > We previously proposed KIP-1176: Tiered Storage for Active Log Segments to > solve similar problems. However, rather than fragmenting the community's > efforts across competing designs, we have decided to withdraw KIP-1176 and > consolidate our engineering resources behind KIP-1150. > > To start, we plan to take ownership of Compaction for Tiered Storage. This > has long been a missing feature in KIP-405, and it becomes critical in a > Diskless architecture where long-lived data is obligated to be tiered. By > driving this related prerequisite feature, we hope to allow for faster > delivery of KIP-1150. > > We are excited to collaborate on this to ensure a robust and timely > delivery for the community. > > Best, > Tom & Henry > > On Fri, Nov 14, 2025 at 8:35 AM Luke Chen <[email protected]> wrote: > > > Hi Greg, > > > > Thanks for sharing the meeting notes. > > I agree we should keep polishing the contents of 1150 & high level design > > in 1163 to prepare for a vote. > > > > > > Thanks. > > Luke > > > > On Fri, Nov 14, 2025 at 3:54 AM Greg Harris <[email protected] > > > > wrote: > > > > > Hi all, > > > > > > There was a video call between myself, Ivan Yurchenko, Jun Rao, and > > Andrew > > > Schofield pertaining to KIP-1150. Here are the notes from that meeting: > > > > > > Ivan: What is the future state of Kafka in this area, in 5 years? > > > Jun: Do we want something more cloud native? Yes, started with Tiered > > > Storage. If there’s a better way, we should explore it. In the long > term > > > this will be useful > > > Because Kafka is used so widely, we need to make sure everything we add > > is > > > for the long term and for everyone, not just for a single company. > > > When we add TS, it doesn’t just solve Uber’s use-case. We want > something > > > that’s high quality/lasts/maintainable, and can work with all existing > > > capabilities. > > > If both 1150 and 1176 proceed at the same time, it’s confusing. They > > > overlap, but Diskless is more ambitious. > > > If both KIPs are being seriously worked on, then we don’t really need > > both, > > > because Diskless clearly is better. Having multiple will confuse > people. > > It > > > will duplicate some of the effort. > > > If we want diskless ultimately, what is the short term strategy, to get > > > some early wins first? > > > Ivan: Andrew, do you want a more revolutionary approach? > > > Andrew: Eventually the architecture will change substantially, it may > not > > > be necessary to put all of that bill onto Diskless at once. > > > Greg: We all agree on having a high quality feature merged upstream, > and > > > supporting all APIs > > > Jun: We should try and keep things simple, but there is some minimum > > > complexity needed. > > > When doing the short term changes (1176), it doesn’t really progress in > > > changing to a more modern architecture. > > > Greg: Was TS+Compaction the only feature miss we’ve had so far? > > > Jun: The danger of only applying changes to some part of the API, you > set > > > the precedence that you only have to implement part of the API. > > Supporting > > > the full API set should be a minimum requirement. > > > Andrew: When we started Kraft, how much did we know the design? > > > Jun: For Kraft we didn’t really know much about the migration, but the > > > high-level was clear. > > > Greg: Is 1150 votable in its current state? > > > Jun: 1150 should promise to support all APIs. It doesn’t have to have > all > > > the details/apis/etc. KIP-500 didn’t have it. > > > We do need some high-level design enough to give confidence that the > > > promise is able to be fulfilled. > > > Greg: Is the draft version in 1163 enough detail or is more needed? > > > Jun: We need to agree on the core design, such as leaderless etc. And > how > > > the APIs will be supported. > > > Greg: Okay we can include these things, and provide a sketch of how the > > > other leader-based features operate. > > > Jun: Yeah if at a high level the sketch appears to work, we can approve > > > that functionality. > > > Are you committed to doing the more involved and big project? > > > Greg: Yes, we’re committed to the 1163 design and can’t really accept > > 1176. > > > Jun: TS was slow because of Uber resourcing problems > > > Greg: We’ll push internally for resources, and use the community > > sentiment > > > to motivate Aiven. > > > How far into the future should we look? What sort of scale? > > > Jun: As long as there’s a path forward, and we’re not closing off > future > > > improvements, we can figure out how to handle a larger scale when it > > > arises. > > > Greg: Random replica placement is very harmful, can we recommend users > to > > > use an external tool like CruiseControl? > > > Jun: Not everyone uses CruiseControl, we would probably need some > > solution > > > for this out of the box > > > Ivan: Should the Batch Coordinator be pluggable? > > > Jun: Out-of-box experience should be good, good to allow other > > > implementations > > > Greg: But it could hurt Kafka feature/upgrade velocity when we wait for > > > plugin providers to implement it > > > Ivan: We imagined that maybe cloud hyperscalers could implement it with > > > e.g. dynamodb > > > Greg: Could we bake more details of the different providers into Kafka, > > or > > > does it still make sense for it to be pluggable? > > > Jun: Make it whatever is easiest to roll out and add new clients > > > Andrew: What happens next? Do you want to get KIP-1150 voted? > > > Ivan: The vote is already open, we’re not too pressed for time. We’ll > go > > > improve the 1163 design and communication. > > > Is 1176 a competing design? Someone will ask. > > > Jun: If we are seriously working on something more ambitious, yeah we > > > shouldn’t do the stop-gap solution. > > > It’s diverting review resources. If we can get the short term thing in > > 1yr > > > but Diskless solution is 2y it makes sense to go for Diskless. If it’s > > 5yr, > > > that’s different and maybe the stop-gap solution is needed. > > > Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we > > > explicitly exclude 1176? > > > Andrew: Put your arms around the feature set you actually want, and use > > > that to rule out 1176. > > > Probably don’t need -1 votes, most likely KIPs just don’t receive > votes. > > > Ivan: Should we have sync meetings like tiered storage did? > > > Jun: Satish posted meeting notes regularly, we should do the same. > > > > > > To summarize, we will be polishing the contents of 1150 & high level > > design > > > in 1163 to prepare for a vote. > > > We believe that the community should select the feature set of 1150 to > > > fully eliminate producer cross-zone costs, and make the investment in a > > > high quality Diskless Topics implementation rather than in stop-gap > > > solutions. > > > > > > Thanks, > > > Greg > > > > > > On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote: > > > > > > > This may be a tangent, but we needed to offload storage off of Kafka > > into > > > > S3. We are keeping Kafka not as a source of truth, but as a mostly > > > > ephemeral broker that can come and go as it pleases. Be that scaling > or > > > > outage. Disks can be destroyed and recreated at will, we still retain > > > data > > > > and use broker for just that, brokering messages. Not only that, we > > > reduced > > > > the requirement on the actual Kafka resources by reducing the size > of a > > > > payload via a claim check pattern. Maybe this is an anti–pattern, but > > it > > > is > > > > super fast and highly cost efficient. We reworked ProducerRequest to > > > allow > > > > plugins. We added a custom http plugin that submits every request > via a > > > > persisted connection to a microservice. Microservice stores the > payload > > > and > > > > returns a tiny json metadata object,a claim check, that can be used > to > > > find > > > > the actual data. Think of it as zipping the payload. This claim check > > > > metadata traverses the pipelines with consumers using the urls in > > > metadata > > > > to pull what they need. Think unzipping. This allowed us to also pull > > > ONLY > > > > the data that we need in graphql like manner. So if you have a 100K > > json > > > > payload and you need only a subsection, you can pull that by > jmespath. > > > When > > > > you have multiple consumer groups yanking down huge payloads it is > > > > cumbersome on the broker. When you have the same consumer groups > > yanking > > > > down a claim check, and then going out of band directly to the source > > of > > > > truth, the broker has some breathing room. Obviously our microservice > > > does > > > > not go directly to the cloud storage, as that would be too slow. It > > > stores > > > > the payload in high speed memory cache and returns asap. That memory > is > > > > eventually persisted into S3. The retrieval goest against the cache > > > first, > > > > then against the S3. Overall a rather cheappy and zippy solution. I > > tried > > > > proposing the KIP for this, but there was no excitement. Check this > > out: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528 > > > > > > > > > > > > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]> > > wrote: > > > > > > > > > > Hi, Andrew, > > > > > > > > > > If we want to focus only on reducing cross-zone replication costs, > > > there > > > > is > > > > > an alternative design in the KIP-1176 discussion thread that seems > > > > simpler > > > > > than the proposal here. I am copying the outline of that approach > > > below. > > > > > > > > > > 1. A new leader is elected. > > > > > 2. Leader maintains a first tiered offset, which is initialized to > > log > > > > end > > > > > offset. > > > > > 3. Leader writes produced data from the client to local log. > > > > > 4. Leader uploads produced data from all local logs as a combined > > > object > > > > > 5. Leader stores the metadata for the combined object in memory. > > > > > 6. If a follower fetch request has an offset >= first tiered > offset, > > > the > > > > > metadata for the corresponding combined object is returned. > > Otherwise, > > > > the > > > > > local data is returned. > > > > > 7. Leader periodically advances first tiered offset. > > > > > > > > > > It's still a bit unnatural, but it could work. > > > > > > > > > > Hi, Ivan, > > > > > > > > > > Are you still committed to proceeding with the original design of > > > > KIP-1150? > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield < > > > > [email protected]> > > > > > wrote: > > > > > > > > > >> Hi, > > > > >> I’ve been following KIP-1150 and friends for a while. I’m going to > > > jump > > > > >> into the discussions too. > > > > >> > > > > >> Looking back at Jack Vanlightly’s message, I am not quite so > > convinced > > > > >> that it’s a kind of fork in the road. The primary aim of the > effort > > is > > > > to > > > > >> reduce cross-zone replication costs so Apache Kafka is not > > > prohibitively > > > > >> expensive to use on cloud storage. I think it would be entirely > > viable > > > > to > > > > >> prioritise code reuse for an initial implementation of diskless > > > topics, > > > > and > > > > >> we could still have a more cloud-native design in the future. It’s > > > hard > > > > to > > > > >> predict what the community will prioritise in the future. > > > > >> > > > > >> Of the three major revisions, I’m in the rev3 camp. We can support > > > > >> leaderless produce requests, first writing WAL segments into > object > > > > >> storage, and then using the regular partition leaders to sequence > > the > > > > >> records. The active log segment for a diskless topic will > initially > > > > contain > > > > >> batch coordinates rather than record batches. The batch > coordinates > > > can > > > > be > > > > >> resolved from WAL segments for consumers, and also in order to > > prepare > > > > log > > > > >> segments for uploading to tiered storage. Jun is probably correct > > that > > > > we > > > > >> need a more frequent object merging process than tiered storage > > > > provides. > > > > >> This is just the transition from write-optimised WAL segments to > > > > >> read-optimised tiered segments, and all of the object > storage-based > > > > >> implementations of Kafka that I’m aware of do this rearrangement. > > But > > > > >> perhaps this more frequent object merging is a pre-GA improvement, > > > > rather > > > > >> than a strict requirement for an initial implementation for early > > > access > > > > >> use. > > > > >> > > > > >> For zone-aligned share consumers, the share group assignor is > > intended > > > > to > > > > >> be rack-aware. Consumers should be assigned to partitions with > > leaders > > > > in > > > > >> their zone. The simple assignor is not rack-aware, but it easily > > could > > > > be > > > > >> or we could have a rack-aware assignor. > > > > >> > > > > >> Thanks, > > > > >> Andrew > > > > >> > > > > >> > > > > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]> > > wrote: > > > > >>> > > > > >>> Hi, Ivan, > > > > >>> > > > > >>> Thanks for the reply. > > > > >>> > > > > >>> "As I understand, you’re speaking about locally materialized > > > segments. > > > > >> They > > > > >>> will indeed consume some IOPS. See them as a cache that could > > always > > > be > > > > >>> restored from the remote storage. While it’s not ideal, it's > still > > OK > > > > to > > > > >>> lose data in them due to a machine crash, for example. Because of > > > this, > > > > >> we > > > > >>> can avoid explicit flushing on local materialized segments at all > > and > > > > let > > > > >>> the file system and page cache figure out when to flush > optimally. > > > This > > > > >>> would not eliminate the extra IOPS, but should reduce it > > > dramatically, > > > > >>> depending on throughput for each partition. We, of course, > continue > > > > >>> flushing the metadata segments as before." > > > > >>> > > > > >>> If we have a mix of classic and diskless topics on the same > broker, > > > > it's > > > > >>> important that the classic topics' data is flushed to disk as > > quickly > > > > as > > > > >>> possible. To achieve this, users typically set > > dirty_expire_centisecs > > > > in > > > > >>> the kernel based on the number of available disk IOPS. Once you > set > > > > this > > > > >>> number, it applies to all dirty files, including the cached data > in > > > > >>> diskless topics. So, if there are more files actively > accumulating > > > > data, > > > > >>> the flush frequency and therefore RPO is reduced for classic > > topics. > > > > >>> > > > > >>> "We should have mentioned this explicitly, but this step, in > fact, > > > > >> remains > > > > >>> in the form of segments offloading to tiered storage. When we > > > assemble > > > > a > > > > >>> segment and hand it over to RemoteLogManager, we’re effectively > > doing > > > > >>> metadata compaction: replacing a big number of pieces of metadata > > > about > > > > >>> individual batches with a single record in > __remote_log_metadata." > > > > >>> > > > > >>> The object merging in tier storage typically only kicks in after > a > > > few > > > > >>> hours. The impact is (1) the amount of accumulated metadata is > > still > > > > >> quite > > > > >>> large; (2) there are many small objects, leading to poor read > > > > >> performance. > > > > >>> I think we need a more frequent object merging process than tier > > > > storage > > > > >>> provides. > > > > >>> > > > > >>> Jun > > > > >>> > > > > >>> > > > > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <[email protected]> > > > > wrote: > > > > >>> > > > > >>>> Hello Jack, Jun, Luke, and all! > > > > >>>> > > > > >>>> Thank you for your messages. > > > > >>>> > > > > >>>> Let me first address some of Jun’s comments. > > > > >>>> > > > > >>>>> First, it degrades the durability. > > > > >>>>> For each partition, now there are two files being actively > > written > > > > at a > > > > >>>>> given point of time, one for the data and another for the > > metadata. > > > > >>>>> Flushing each file requires a separate IO. If the disk has 1K > > IOPS > > > > and > > > > >> we > > > > >>>>> have 5K partitions in a broker, currently we can afford to > flush > > > each > > > > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we > > > > double > > > > >>>> the > > > > >>>>> number of files per partition, we can only flush each partition > > > every > > > > >> 10 > > > > >>>>> seconds, which makes RPO twice as bad. > > > > >>>> > > > > >>>> As I understand, you’re speaking about locally materialized > > > segments. > > > > >> They > > > > >>>> will indeed consume some IOPS. See them as a cache that could > > always > > > > be > > > > >>>> restored from the remote storage. While it’s not ideal, it's > still > > > OK > > > > to > > > > >>>> lose data in them due to a machine crash, for example. Because > of > > > > this, > > > > >> we > > > > >>>> can avoid explicit flushing on local materialized segments at > all > > > and > > > > >> let > > > > >>>> the file system and page cache figure out when to flush > optimally. > > > > This > > > > >>>> would not eliminate the extra IOPS, but should reduce it > > > dramatically, > > > > >>>> depending on throughput for each partition. We, of course, > > continue > > > > >>>> flushing the metadata segments as before. > > > > >>>> > > > > >>>> It’s worth making a note on caching. I think nobody will > disagree > > > that > > > > >>>> doing direct reads from remote storage every time a batch is > > > requested > > > > >> by a > > > > >>>> consumer will not be practical neither from the performance nor > > from > > > > the > > > > >>>> economy point of view. We need a way to keep the number of GET > > > > requests > > > > >>>> down. There are multiple options, for example: > > > > >>>> 1. Rack-aware distributed in-memory caching. > > > > >>>> 2. Local in-memory caching. Comes with less network chattiness > and > > > > >> works > > > > >>>> well if we have more or less stable brokers to consume from. > > > > >>>> 3. Materialization of diskless logs on local disk. Way lower > > impact > > > on > > > > >>>> RAM and also requires stable brokers for consumption (using just > > > > >> assigned > > > > >>>> replicas will probably work well). > > > > >>>> > > > > >>>> Materialization is one of possible options, but we can choose > > > another > > > > >> one. > > > > >>>> However, we will have this dilemma regardless of whether we have > > an > > > > >>>> explicit coordinator or we go “coordinator-less”. > > > > >>>> > > > > >>>>> Second, if we ever need this > > > > >>>>> metadata somewhere else, say in the WAL file manager, the > > consumer > > > > >> needs > > > > >>>> to > > > > >>>>> subscribe to every partition in the cluster, which is > > inefficient. > > > > The > > > > >>>>> actual benefit of this approach is also questionable. On the > > > surface, > > > > >> it > > > > >>>>> might seem that we could reduce the number of lines that need > to > > be > > > > >>>> changed > > > > >>>>> for this KIP. However, the changes are quite intrusive to the > > > classic > > > > >>>>> partition's code path and will probably make the code base > harder > > > to > > > > >>>>> maintain in the long run. I like the original approach based on > > the > > > > >> batch > > > > >>>>> coordinator much better than this one. We could probably > refactor > > > the > > > > >>>>> producer state code so that it could be reused in the batch > > > > >> coordinator. > > > > >>>> > > > > >>>> It’s hard to disagree with this. The explicit coordinator is > more > > a > > > > side > > > > >>>> thing, while coordinator-less approach is more about extending > > > > >>>> ReplicaManager, UnifiedLog and others substantially. > > > > >>>> > > > > >>>>> Thanks for addressing the concerns on the number of RPCs in the > > > > produce > > > > >>>>> path. I agree that with the metadata crafting mechanism, we > could > > > > >>>> mitigate > > > > >>>>> the PRC problem. However, since we now require the metadata to > be > > > > >>>>> collocated with the data on the same set of brokers, it's weird > > > that > > > > >> they > > > > >>>>> are now managed by different mechanisms. The data assignment > now > > > uses > > > > >> the > > > > >>>>> metadata crafting mechanism, but the metadata is stored in the > > > > classic > > > > >>>>> partition using its own assignment strategy. It will be > > complicated > > > > to > > > > >>>> keep > > > > >>>>> them collocated. > > > > >>>> > > > > >>>> I would like to note that the metadata crafting is needed only > to > > > tell > > > > >>>> producers which brokers they should send Produce requests to, > but > > > data > > > > >> (as > > > > >>>> in “locally materialized log”) is located on partition replicas, > > > i.e. > > > > >>>> automatically co-located with metadata. > > > > >>>> > > > > >>>> As a side note, it would probably be better that instead of > > > implicitly > > > > >>>> crafting partition metadata, we extend the metadata protocol so > > that > > > > for > > > > >>>> diskless partitions we return not only the leader and replicas, > > but > > > > also > > > > >>>> some “recommended produce brokers”, selected for optimal > > performance > > > > and > > > > >>>> costs. Producers will pick ones in their racks. > > > > >>>> > > > > >>>>> I am also concerned about the removal of the object > > > > compaction/merging > > > > >>>>> step. > > > > >>>> > > > > >>>> We should have mentioned this explicitly, but this step, in > fact, > > > > >> remains > > > > >>>> in the form of segments offloading to tiered storage. When we > > > > assemble a > > > > >>>> segment and hand it over to RemoteLogManager, we’re effectively > > > doing > > > > >>>> metadata compaction: replacing a big number of pieces of > metadata > > > > about > > > > >>>> individual batches with a single record in > __remote_log_metadata. > > > > >>>> > > > > >>>> We could create a Diskless-specific merging mechanism instead if > > > > needed. > > > > >>>> It’s rather easy with the explicit coordinator approach. With > the > > > > >>>> coordinator-less approach, this would probably be a bit more > > tricky > > > > >>>> (rewriting the tail of the log by the leader + replicating this > > > change > > > > >>>> reliably). > > > > >>>> > > > > >>>>> I see a tendency toward primarily optimizing for the fewest > code > > > > >> changes > > > > >>>> in > > > > >>>>> the KIP. Instead, our primary goal should be a clean design > that > > > can > > > > >> last > > > > >>>>> for the long term. > > > > >>>> > > > > >>>> Yes, totally agree. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Luke, > > > > >>>>> I'm wondering if the complexity of designing txn and queue is > > > because > > > > >> of > > > > >>>>> leaderless cluster, do you think it will be simpler if we only > > > focus > > > > on > > > > >>>> the > > > > >>>>> "diskless" design to handle object compaction/merging to/from > the > > > > >> remote > > > > >>>>> storage to save the cross-AZ cost? > > > > >>>> > > > > >>>> After some evolution of the original proposal, leaderless is now > > > > >> limited. > > > > >>>> We only need to be able to accept Produce requests on more than > > one > > > > >> broker > > > > >>>> to eliminate the cross-AZ costs for producers. Do I get it right > > > that > > > > >> you > > > > >>>> propose to get rid of this? Or do I misunderstand? > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Let’s now look at this problem from a higher level, as Jack > > > proposed. > > > > As > > > > >>>> it was said, the big choice we need to make is whether we 1) > > create > > > an > > > > >>>> explicit batch coordinator; or 2) go for the coordinator-less > > > > approach, > > > > >>>> where each diskless partition is managed by its leader as in > > classic > > > > >> topics. > > > > >>>> > > > > >>>> If we try to compare the two approaches: > > > > >>>> > > > > >>>> Pluggability: > > > > >>>> - Explicit coordinator: Possible. For example, some setups may > > > benefit > > > > >>>> from batch metadata being stored in a cloud database (such as > AWS > > > > >> DynamoDB > > > > >>>> or GCP Spanner). > > > > >>>> - Coordinator-less: Impossible. > > > > >>>> > > > > >>>> Scalability and fault tolerance: > > > > >>>> - Explicit coordinator: Depends on the implementation and it’s > > also > > > > >>>> necessary to actively work for it. > > > > >>>> - Coordinator-less: Closer to classic Kafka topics. Scaling is > > done > > > by > > > > >>>> partition placement, partitions could fail independently. > > > > >>>> > > > > >>>> Separation of concerns: > > > > >>>> - Explicit coordinator: Very good. Diskless remains more > > independent > > > > >> from > > > > >>>> classic topics in terms of code and workflows. For example, the > > > > >>>> above-mentioned non-tiered storage metadata compaction mechanism > > > could > > > > >> be > > > > >>>> relatively simply implemented with it. As a flip side of this, > > some > > > > >>>> workflows (e.g. transactions) will have to be adapted. > > > > >>>> - Coordinator-less: Less so. It leans to the opposite: bringing > > > > diskless > > > > >>>> closer to classic topics. Some code paths and workflows could be > > > more > > > > >>>> straightforwardly reused, but they will inevitably have to be > > > adapted > > > > to > > > > >>>> accommodate both topic types as also discussed. > > > > >>>> > > > > >>>> Cloud-nativeness. This is a vague concept, also related to the > > > > previous, > > > > >>>> but let’s try: > > > > >>>> - Explicit coordinator: Storing and processing metadata > separately > > > > makes > > > > >>>> it easier for brokers to take different roles, be purely > stateless > > > if > > > > >>>> needed, etc. > > > > >>>> - Coordinator-less: Less so. Something could be achieved with > > > creative > > > > >>>> partition placement, but not much. > > > > >>>> > > > > >>>> Both seem to have their pros and cons. However, answering Jack’s > > > > >> question, > > > > >>>> the explicit coordinator approach may indeed lead to a more > > flexible > > > > >> design. > > > > >>>> > > > > >>>> > > > > >>>> The purpose of this deviation in the discussion was to receive a > > > > >>>> preliminary community evaluation of the coordinator-less > approach > > > > >> without > > > > >>>> taking on the task of writing a separate KIP and fitting it in > the > > > > >> system > > > > >>>> of KIP-1150 and its children. We’re open to stopping it and > > getting > > > > >> back to > > > > >>>> working out the coordinator design if the community doesn’t > favor > > > the > > > > >>>> proposed approach. > > > > >>>> > > > > >>>> Best, > > > > >>>> Ivan and Diskless team > > > > >>>> > > > > >>>> On Mon, Oct 20, 2025, at 05:58, Luke Chen wrote: > > > > >>>>> Hi Ivan, > > > > >>>>> > > > > >>>>> As Jun pointed out, the updated design seems to have some > > > > shortcomings > > > > >>>>> although it simplifies the implementation. > > > > >>>>> > > > > >>>>> I'm wondering if the complexity of designing txn and queue is > > > because > > > > >> of > > > > >>>>> leaderless cluster, do you think it will be simpler if we only > > > focus > > > > on > > > > >>>> the > > > > >>>>> "diskless" design to handle object compaction/merging to/from > the > > > > >> remote > > > > >>>>> storage to save the cross-AZ cost? > > > > >>>>> > > > > >>>>> > > > > >>>>> Thank you, > > > > >>>>> Luke > > > > >>>>> > > > > >>>>> On Sat, Oct 18, 2025 at 5:22 AM Jun Rao > <[email protected] > > > > > > > >>>> wrote: > > > > >>>>> > > > > >>>>>> Hi, Ivan, > > > > >>>>>> > > > > >>>>>> Thanks for the explanation. > > > > >>>>>> > > > > >>>>>> "we write the reference to the WAL file with the batch data" > > > > >>>>>> > > > > >>>>>> I understand the approach now, but I think it is a hacky one. > > > There > > > > >> are > > > > >>>>>> multiple short comings with this design. First, it degrades > the > > > > >>>> durability. > > > > >>>>>> For each partition, now there are two files being actively > > written > > > > at > > > > >> a > > > > >>>>>> given point of time, one for the data and another for the > > > metadata. > > > > >>>>>> Flushing each file requires a separate IO. If the disk has 1K > > IOPS > > > > and > > > > >>>> we > > > > >>>>>> have 5K partitions in a broker, currently we can afford to > flush > > > > each > > > > >>>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If > we > > > > double > > > > >>>> the > > > > >>>>>> number of files per partition, we can only flush each > partition > > > > every > > > > >>>> 10 > > > > >>>>>> seconds, which makes RPO twice as bad. Second, if we ever need > > > this > > > > >>>>>> metadata somewhere else, say in the WAL file manager, the > > consumer > > > > >>>> needs to > > > > >>>>>> subscribe to every partition in the cluster, which is > > inefficient. > > > > The > > > > >>>>>> actual benefit of this approach is also questionable. On the > > > > surface, > > > > >>>> it > > > > >>>>>> might seem that we could reduce the number of lines that need > to > > > be > > > > >>>> changed > > > > >>>>>> for this KIP. However, the changes are quite intrusive to the > > > > classic > > > > >>>>>> partition's code path and will probably make the code base > > harder > > > to > > > > >>>>>> maintain in the long run. I like the original approach based > on > > > the > > > > >>>> batch > > > > >>>>>> coordinator much better than this one. We could probably > > refactor > > > > the > > > > >>>>>> producer state code so that it could be reused in the batch > > > > >>>> coordinator. > > > > >>>>>> > > > > >>>>>> Thanks for addressing the concerns on the number of RPCs in > the > > > > >> produce > > > > >>>>>> path. I agree that with the metadata crafting mechanism, we > > could > > > > >>>> mitigate > > > > >>>>>> the PRC problem. However, since we now require the metadata to > > be > > > > >>>>>> collocated with the data on the same set of brokers, it's > weird > > > that > > > > >>>> they > > > > >>>>>> are now managed by different mechanisms. The data assignment > now > > > > uses > > > > >>>> the > > > > >>>>>> metadata crafting mechanism, but the metadata is stored in the > > > > classic > > > > >>>>>> partition using its own assignment strategy. It will be > > > complicated > > > > to > > > > >>>> keep > > > > >>>>>> them collocated. > > > > >>>>>> > > > > >>>>>> I am also concerned about the removal of the object > > > > compaction/merging > > > > >>>>>> step. My first concern is on the amount of metadata that need > to > > > be > > > > >>>> kept. > > > > >>>>>> Without object compcation, the metadata generated in the > produce > > > > path > > > > >>>> can > > > > >>>>>> only be deleted after remote tiering kicks in. Let's say for > > every > > > > >>>> 250ms we > > > > >>>>>> produce 100 byte of metadata per partition. Let's say remoting > > > > tiering > > > > >>>>>> kicks in after 5 hours. In a cluster with 100K partitions, we > > need > > > > to > > > > >>>> keep > > > > >>>>>> about 100 * (1 / 0.2) * 5 * 3600 * 100K = 720 GB metadata, > > quite > > > > >>>>>> signficant. A second concern is on performance. Every time we > > need > > > > to > > > > >>>>>> rebuild the caching data, we need to read a bunch of small > > objects > > > > >>>> from S3, > > > > >>>>>> slowing down the building process. If a consumer happens to > need > > > > such > > > > >>>> data, > > > > >>>>>> it could slow down the application. > > > > >>>>>> > > > > >>>>>> I see a tendency toward primarily optimizing for the fewest > code > > > > >>>> changes in > > > > >>>>>> the KIP. Instead, our primary goal should be a clean design > that > > > can > > > > >>>> last > > > > >>>>>> for the long term. > > > > >>>>>> > > > > >>>>>> Thanks, > > > > >>>>>> > > > > >>>>>> Jun > > > > >>>>>> > > > > >>>>>> On Tue, Oct 14, 2025 at 11:02 AM Ivan Yurchenko < > [email protected] > > > > > > > >>>> wrote: > > > > >>>>>> > > > > >>>>>>> Hi Jun, > > > > >>>>>>> > > > > >>>>>>> Thank you for your message. I’m sorry that I failed to > clearly > > > > >>>> explain > > > > >>>>>> the > > > > >>>>>>> idea. Let me try to fix this. > > > > >>>>>>> > > > > >>>>>>>> Does each partition now have a metadata partition and a > > separate > > > > >>>> data > > > > >>>>>>>> partition? If so, I am concerned that it essentially doubles > > the > > > > >>>> number > > > > >>>>>>> of > > > > >>>>>>>> partitions, which impacts the number of open file > descriptors > > > and > > > > >>>> the > > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a > > > > separate > > > > >>>>>>>> partition just to store the metadata. It's as if we are > > creating > > > > an > > > > >>>>>>>> internal topic with an unbounded number of partitions. > > > > >>>>>>> > > > > >>>>>>> No. There will be only one physical partition per diskless > > > > >>>> partition. Let > > > > >>>>>>> me explain this with an example. Let’s say we have a diskless > > > > >>>> partition > > > > >>>>>>> topic-0. It has three replicas 0, 1, 2; 0 is the leader. We > > > produce > > > > >>>> some > > > > >>>>>>> batches to this partition. The content of the segment file > will > > > be > > > > >>>>>>> something like this (for each batch): > > > > >>>>>>> > > > > >>>>>>> BaseOffset: 00000000000000000000 (like in classic) > > > > >>>>>>> Length: 123456 (like in classic) > > > > >>>>>>> PartitionLeaderEpoch: like in classic > > > > >>>>>>> Magic: like in classic > > > > >>>>>>> CRC: like in classic > > > > >>>>>>> Attributes: like in classic > > > > >>>>>>> LastOffsetDelta: like in classic > > > > >>>>>>> BaseTimestamp: like in classic > > > > >>>>>>> MaxTimestamp: like in classic > > > > >>>>>>> ProducerId: like in classic > > > > >>>>>>> ProducerEpoch: like in classic > > > > >>>>>>> BaseSequence: like in classic > > > > >>>>>>> RecordsCount: like in classic > > > > >>>>>>> Records: > > > > >>>>>>> path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a; byte > > > offset > > > > >>>>>>> 123456 > > > > >>>>>>> > > > > >>>>>>> It looks very much like classic log entries. The only > > difference > > > is > > > > >>>> that > > > > >>>>>>> instead of writing real Records, we write the reference to > the > > > WAL > > > > >>>> file > > > > >>>>>>> with the batch data (I guess we need only the name and the > byte > > > > >>>> offset, > > > > >>>>>>> because the byte length is the standard field above). > > Otherwise, > > > > >>>> it’s a > > > > >>>>>>> normal Kafka log with the leader and replicas. > > > > >>>>>>> > > > > >>>>>>> So we have as many partitions for diskless as for classic. As > > of > > > > open > > > > >>>>>> file > > > > >>>>>>> descriptors, let’s proceed to the following: > > > > >>>>>>> > > > > >>>>>>>> Are the metadata and > > > > >>>>>>>> the data for the same partition always collocated on the > same > > > > >>>> broker? > > > > >>>>>> If > > > > >>>>>>>> so, how do we enforce that when replicas are reassigned? > > > > >>>>>>> > > > > >>>>>>> The source of truth for the data is still in WAL files on > > object > > > > >>>> storage. > > > > >>>>>>> The source of truth for the metadata is in segment files on > the > > > > >>>> brokers > > > > >>>>>> in > > > > >>>>>>> the replica set. Two new mechanisms are planned, both > > independent > > > > of > > > > >>>> this > > > > >>>>>>> new proposal, but I want to present them to give the idea > that > > > > only a > > > > >>>>>>> limited amount of data files will be operated locally: > > > > >>>>>>> - We want to assemble batches into segment files and offload > > them > > > > to > > > > >>>>>>> tiered storage in order to prevent the unbounded growth of > > batch > > > > >>>>>> metadata. > > > > >>>>>>> For this, we need to open only a few file descriptors (for > the > > > > >>>> segment > > > > >>>>>>> file itself + the necessary indexes) before the segment is > > fully > > > > >>>> written > > > > >>>>>>> and handed over to RemoteLogManager. > > > > >>>>>>> - We want to assemble local segment files for caching > purposes > > as > > > > >>>> well, > > > > >>>>>>> i.e. to speed up fetching. This will not materialize the full > > > > >>>> content of > > > > >>>>>>> the log, but only the hot set according to some policy (or > > > > >>>> configurable > > > > >>>>>>> policies), i.e. the number of segments and file descriptors > > will > > > > >>>> also be > > > > >>>>>>> limited. > > > > >>>>>>> > > > > >>>>>>>> The number of RPCs in the produce path is significantly > > higher. > > > > For > > > > >>>>>>>> example, if a produce request has 100 partitions, in a > cluster > > > > >>>> with 100 > > > > >>>>>>>> brokers, each produce request could generate 100 more RPC > > > > requests. > > > > >>>>>> This > > > > >>>>>>>> will significantly increase the request rate. > > > > >>>>>>> > > > > >>>>>>> This is a valid concern that we considered, but this issue > can > > be > > > > >>>>>>> mitigated. I’ll try to explain the approach. > > > > >>>>>>> The situation with a single broker is trivial: all the commit > > > > >>>> requests go > > > > >>>>>>> from the broker to itself. > > > > >>>>>>> Let’s scale this to a multi-broker cluster, but located in > the > > > > single > > > > >>>>>> rack > > > > >>>>>>> (AZ). Any broker can accept Produce requests for diskless > > > > >>>> partitions, but > > > > >>>>>>> we can tell producers (through metadata) to always send > Produce > > > > >>>> requests > > > > >>>>>> to > > > > >>>>>>> leaders. For example, broker 0 hosts the leader replicas for > > > > diskless > > > > >>>>>>> partitions t1-0, t2-1, t3-0. It will receive diskless Produce > > > > >>>> requests > > > > >>>>>> for > > > > >>>>>>> these partitions in various combinations, but only for them. > > > > >>>>>>> > > > > >>>>>>> Broker 0 > > > > >>>>>>> +-----------------+ > > > > >>>>>>> | t1-0 | > > > > >>>>>>> | t2-1 <--------------------+ > > > > >>>>>>> | t3-0 | | > > > > >>>>>>> produce | +-------------+ | | > > > > >>>>>>> requests | | diskless | | | > > > > >>>>>>> --------------->| produce +--------------+ > > > > >>>>>>> for these | | WAL buffer | | commit requests > > > > >>>>>>> partitions | +-------------+ | for these partitions > > > > >>>>>>> | | > > > > >>>>>>> +-----------------+ > > > > >>>>>>> > > > > >>>>>>> The same applies for other brokers in this cluster. > > Effectively, > > > > each > > > > >>>>>>> broker will commit only to itself, which effectively means 1 > > > commit > > > > >>>>>> request > > > > >>>>>>> per WAL buffer (this may be 0 physical network calls, if we > > wish, > > > > >>>> just a > > > > >>>>>>> local function call). > > > > >>>>>>> > > > > >>>>>>> Now let’s scale this to multiple racks (AZs). Obviously, we > > > cannot > > > > >>>> always > > > > >>>>>>> send Produce requests to the designated leaders of diskless > > > > >>>> partitions: > > > > >>>>>>> this would mean inter-AZ network traffic, which we would like > > to > > > > >>>> avoid. > > > > >>>>>> To > > > > >>>>>>> avoid it, we say that every broker has a “diskless produce > > > > >>>>>> representative” > > > > >>>>>>> in every AZ. If we continue our example: when a Produce > request > > > for > > > > >>>> t1-0, > > > > >>>>>>> t2-1, or t3-0 comes from a producer in AZ 0, it lands on > > broker 0 > > > > >>>> (in the > > > > >>>>>>> broker’s AZ the representative is the broker itself). > However, > > if > > > > it > > > > >>>>>> comes > > > > >>>>>>> from AZ 1, it lands on broker 1; in AZ 2, it’s broker 2. > > > > >>>>>>> > > > > >>>>>>> |produce requests |produce requests |produce > > > > >>>> requests > > > > >>>>>>> |for t1-0, t2-1, t3-0 |for t1-0, t2-1, t3-0 |for t1-0, > > > t2-1, > > > > >>>>>> t3-0 > > > > >>>>>>> |from AZ 0 |from AZ 1 |from AZ 2 > > > > >>>>>>> v v v > > > > >>>>>>> Broker 0 (AZ 0) Broker 1 (AZ 1) Broker 2 (AZ 2) > > > > >>>>>>> +---------------+ +---------------+ > +---------------+ > > > > >>>>>>> | t1-0 | | | | > | > > > > >>>>>>> | t2-1 | | | | > | > > > > >>>>>>> | t3-0 | | | | > | > > > > >>>>>>> +---------------+ +--------+------+ > +--------+------+ > > > > >>>>>>> ^ ^ | | > > > > >>>>>>> | +--------------------+ | > > > > >>>>>>> | commit requests for these partitions | > > > > >>>>>>> | | > > > > >>>>>>> +-------------------------------------------------+ > > > > >>>>>>> commit requests for these partitions > > > > >>>>>>> > > > > >>>>>>> All the partitions that broker 0 is the leader of will be > > > > >>>> “represented” > > > > >>>>>> by > > > > >>>>>>> brokers 1 and 2 in their AZs. > > > > >>>>>>> > > > > >>>>>>> Of course, this relationship goes both ways between AZs (not > > > > >>>> necessarily > > > > >>>>>>> between the same brokers). It means that provided the cluster > > is > > > > >>>> balanced > > > > >>>>>>> by the number of brokers per AZ, each broker will represent > > > > >>>>>> (number_of_azs > > > > >>>>>>> - 1) other brokers. This will result in the situation that > for > > > the > > > > >>>>>> majority > > > > >>>>>>> of commits, each broker will do up to (number_of_azs - 1) > > network > > > > >>>> commit > > > > >>>>>>> requests (plus one local). Cloud regions tend to have 3 AZs, > > very > > > > >>>> rarely > > > > >>>>>>> more. That means, brokers will be doing up to 2 network > commit > > > > >>>> requests > > > > >>>>>> per > > > > >>>>>>> WAL file. > > > > >>>>>>> > > > > >>>>>>> There are the following exceptions: > > > > >>>>>>> 1. Broker count imbalance between AZs. For example, when we > > have > > > 2 > > > > >>>> AZs > > > > >>>>>> and > > > > >>>>>>> one has three brokers and another AZ has one. This one broker > > > will > > > > do > > > > >>>>>>> between 1 and 3 commit requests per WAL file. This is not an > > > > extreme > > > > >>>>>>> amplification. Such an imbalance is not healthy in most > > practical > > > > >>>> setups > > > > >>>>>>> and should be avoided anyway. > > > > >>>>>>> 2. Leadership changes and metadata propagation period. When > the > > > > >>>> partition > > > > >>>>>>> t3-0 is relocated from broker 0 to some broker 3, the > producers > > > > will > > > > >>>> not > > > > >>>>>>> know this immediately (unless we want to be strict and > respond > > > with > > > > >>>>>>> NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will come > > > > >>>> together > > > > >>>>>> in a > > > > >>>>>>> WAL buffer on broker 2, it will have to send two commit > > requests: > > > > to > > > > >>>>>> broker > > > > >>>>>>> 0 to commit t1-0 and t2-1, and to broker 3 to commit t3-0. > This > > > > >>>> situation > > > > >>>>>>> is not permanent and as producers update the cluster > metadata, > > it > > > > >>>> will be > > > > >>>>>>> resolved. > > > > >>>>>>> > > > > >>>>>>> This all could be built with the metadata crafting mechanism > > only > > > > >>>> (which > > > > >>>>>>> is anyway needed for Diskless in one way or another to direct > > > > >>>> producers > > > > >>>>>> and > > > > >>>>>>> consumers where we need to avoid inter-AZ traffic), just with > > the > > > > >>>> right > > > > >>>>>>> policy for it (for example, some deterministic hash-based > > > formula). > > > > >>>> I.e. > > > > >>>>>> no > > > > >>>>>>> explicit support for “produce representative” or anything > like > > > this > > > > >>>> is > > > > >>>>>>> needed on the cluster level, in KRaft, etc. > > > > >>>>>>> > > > > >>>>>>>> The same WAL file metadata is now duplicated into two > places, > > > > >>>> partition > > > > >>>>>>>> leader and WAL File Manager. Which one is the source of > truth, > > > and > > > > >>>> how > > > > >>>>>> do > > > > >>>>>>>> we maintain consistency between the two places? > > > > >>>>>>> > > > > >>>>>>> We do only two operations on WAL files that span multiple > > > diskless > > > > >>>>>>> partitions: committing and deleting. Commits can be done > > > > >>>> independently as > > > > >>>>>>> described above. But deletes are different, because when a > file > > > is > > > > >>>>>> deleted, > > > > >>>>>>> this affects all the partitions that still have alive batches > > in > > > > this > > > > >>>>>> file > > > > >>>>>>> (if any). > > > > >>>>>>> > > > > >>>>>>> The WAL file manager is a necessary point of coordination to > > > delete > > > > >>>> WAL > > > > >>>>>>> files safely. We can say it is the source of truth about > files > > > > >>>>>> themselves, > > > > >>>>>>> while the partition leaders and their logs hold the truth > about > > > > >>>> whether a > > > > >>>>>>> particular file contains live batches of this particular > > > partition. > > > > >>>>>>> > > > > >>>>>>> The file manager will do this important task: be able to say > > for > > > > sure > > > > >>>>>> that > > > > >>>>>>> a file does not contain any live batch of any existing > > partition. > > > > For > > > > >>>>>> this, > > > > >>>>>>> it will have to periodically check against the partition > > leaders. > > > > >>>>>>> Considering that batch deletion is irreversible, when we > > declare > > > a > > > > >>>> file > > > > >>>>>>> “empty”, this is guaranteed to be and stay so. > > > > >>>>>>> > > > > >>>>>>> The file manager has to know about files being committed to > > start > > > > >>>> track > > > > >>>>>>> them and periodically check if they are empty. We can > consider > > > > >>>> various > > > > >>>>>> ways > > > > >>>>>>> to achieve this: > > > > >>>>>>> 1. As was proposed in my previous message: best effort commit > > by > > > > >>>> brokers > > > > >>>>>> + > > > > >>>>>>> periodic prefix scans of object storage to detect files that > > went > > > > >>>> below > > > > >>>>>> the > > > > >>>>>>> radar due to network issue or the file manager temporary > > > > >>>> unavailability. > > > > >>>>>>> We’re speaking about listing the file names only and opening > > only > > > > >>>>>>> previously unknown files in order to find the partitions > > involved > > > > >>>> with > > > > >>>>>> them. > > > > >>>>>>> 2. Only do scans without explicit commit, i.e. fill the list > of > > > > files > > > > >>>>>>> fully asynchronously and in the background. This may be not > > ideal > > > > >>>> due to > > > > >>>>>>> costs and performance of scanning tons of files. However, the > > > > number > > > > >>>> of > > > > >>>>>>> live WAL files should be limited due to tiered storage > > > offloading + > > > > >>>> we > > > > >>>>>> can > > > > >>>>>>> optimize this if we give files some global soft order in > their > > > > names. > > > > >>>>>>> > > > > >>>>>>>> I am not sure how this design simplifies the implementation. > > The > > > > >>>>>> existing > > > > >>>>>>>> producer/replication code can't be simply reused. Adjusting > > both > > > > >>>> the > > > > >>>>>>> write > > > > >>>>>>>> path in the leader and the replication path in the follower > to > > > > >>>>>> understand > > > > >>>>>>>> batch-header only data is quite intrusive to the existing > > logic. > > > > >>>>>>> > > > > >>>>>>> It is true that we’ll have to change LocalLog and UnifiedLog > in > > > > >>>> order to > > > > >>>>>>> support these changes. However, it seems that idempotence, > > > > >>>> transactions, > > > > >>>>>>> queues, tiered storage will have to be changed less than with > > the > > > > >>>>>> original > > > > >>>>>>> design. This is because the partition leader state would > remain > > > in > > > > >>>> the > > > > >>>>>> same > > > > >>>>>>> place (on brokers) and existing workflows that involve it > would > > > > have > > > > >>>> to > > > > >>>>>> be > > > > >>>>>>> changed less compared to the situation where we globalize the > > > > >>>> partition > > > > >>>>>>> leader state in the batch coordinator. I admit this is hard > to > > > make > > > > >>>>>>> convincing without both real implementations to hand :) > > > > >>>>>>> > > > > >>>>>>>> I am also > > > > >>>>>>>> not sure how this enables seamless switching the topic modes > > > > >>>> between > > > > >>>>>>>> diskless and classic. Could you provide more details on > those? > > > > >>>>>>> > > > > >>>>>>> Let’s consider the scenario of turning a classic topic into > > > > >>>> diskless. The > > > > >>>>>>> user sets diskless.enabled=true, the leader receives this > > > metadata > > > > >>>> update > > > > >>>>>>> and does the following: > > > > >>>>>>> 1. Stop accepting normal append writes. > > > > >>>>>>> 2. Close the current active segment. > > > > >>>>>>> 3. Start a new segment that will be written in the diskless > > > format > > > > >>>> (i.e. > > > > >>>>>>> without data). > > > > >>>>>>> 4. Start accepting diskless commits. > > > > >>>>>>> > > > > >>>>>>> Since it’s the same log, the followers will know about that > > > switch > > > > >>>>>>> consistently. They will finish replicating the classic > segments > > > and > > > > >>>> start > > > > >>>>>>> replicating the diskless ones. They will always know where > each > > > > >>>> batch is > > > > >>>>>>> located (either inside a classic segment or referenced by a > > > > diskless > > > > >>>>>> one). > > > > >>>>>>> Switching back should be similar. > > > > >>>>>>> > > > > >>>>>>> Doing this with the coordinator is possible, but has some > > > caveats. > > > > >>>> The > > > > >>>>>>> leader must do the following: > > > > >>>>>>> 1. Stop accepting normal append writes. > > > > >>>>>>> 2. Close the current active segment. > > > > >>>>>>> 3. Write a special control segment to persist and replicate > the > > > > fact > > > > >>>> that > > > > >>>>>>> from offset N the partition is now in the diskless mode. > > > > >>>>>>> 4. Inform the coordinator about the first offset N of the > > > “diskless > > > > >>>> era”. > > > > >>>>>>> 5. Inform the controller quorum that the transition has > > finished > > > > and > > > > >>>> that > > > > >>>>>>> brokers now can process diskless writes for this partition. > > > > >>>>>>> This could fail at some points, so this will probably require > > > some > > > > >>>>>>> explicit state machine with replication either in the > partition > > > log > > > > >>>> or in > > > > >>>>>>> KRaft. > > > > >>>>>>> > > > > >>>>>>> It seems that the coordinator-less approach makes this > simpler > > > > >>>> because > > > > >>>>>> the > > > > >>>>>>> “coordinator” for the partition and the partition leader are > > the > > > > >>>> same and > > > > >>>>>>> they store the partition metadata in the same log, too. While > > in > > > > the > > > > >>>>>>> coordinator approach we have to perform some kind of a > > > distributed > > > > >>>> commit > > > > >>>>>>> to handover metadata management from the classic partition > > leader > > > > to > > > > >>>> the > > > > >>>>>>> batch coordinator. > > > > >>>>>>> > > > > >>>>>>> I hope these explanations help to clarify the idea. Please > let > > me > > > > >>>> know if > > > > >>>>>>> I should go deeper anywhere. > > > > >>>>>>> > > > > >>>>>>> Best, > > > > >>>>>>> Ivan and the Diskless team > > > > >>>>>>> > > > > >>>>>>> On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote: > > > > >>>>>>>> Hi, Ivan, > > > > >>>>>>>> > > > > >>>>>>>> Thanks for the update. > > > > >>>>>>>> > > > > >>>>>>>> I am not sure that I fully understand the new design, but it > > > seems > > > > >>>> less > > > > >>>>>>>> clean than before. > > > > >>>>>>>> > > > > >>>>>>>> Does each partition now have a metadata partition and a > > separate > > > > >>>> data > > > > >>>>>>>> partition? If so, I am concerned that it essentially doubles > > the > > > > >>>> number > > > > >>>>>>> of > > > > >>>>>>>> partitions, which impacts the number of open file > descriptors > > > and > > > > >>>> the > > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a > > > > separate > > > > >>>>>>>> partition just to store the metadata. It's as if we are > > creating > > > > an > > > > >>>>>>>> internal topic with an unbounded number of partitions. Are > the > > > > >>>> metadata > > > > >>>>>>> and > > > > >>>>>>>> the data for the same partition always collocated on the > same > > > > >>>> broker? > > > > >>>>>> If > > > > >>>>>>>> so, how do we enforce that when replicas are reassigned? > > > > >>>>>>>> > > > > >>>>>>>> The number of RPCs in the produce path is significantly > > higher. > > > > For > > > > >>>>>>>> example, if a produce request has 100 partitions, in a > cluster > > > > >>>> with 100 > > > > >>>>>>>> brokers, each produce request could generate 100 more RPC > > > > requests. > > > > >>>>>> This > > > > >>>>>>>> will significantly increase the request rate. > > > > >>>>>>>> > > > > >>>>>>>> The same WAL file metadata is now duplicated into two > places, > > > > >>>> partition > > > > >>>>>>>> leader and WAL File Manager. Which one is the source of > truth, > > > and > > > > >>>> how > > > > >>>>>> do > > > > >>>>>>>> we maintain consistency between the two places? > > > > >>>>>>>> > > > > >>>>>>>> I am not sure how this design simplifies the implementation. > > The > > > > >>>>>> existing > > > > >>>>>>>> producer/replication code can't be simply reused. Adjusting > > both > > > > >>>> the > > > > >>>>>>> write > > > > >>>>>>>> path in the leader and the replication path in the follower > to > > > > >>>>>> understand > > > > >>>>>>>> batch-header only data is quite intrusive to the existing > > > logic. I > > > > >>>> am > > > > >>>>>>> also > > > > >>>>>>>> not sure how this enables seamless switching the topic modes > > > > >>>> between > > > > >>>>>>>> diskless and classic. Could you provide more details on > those? > > > > >>>>>>>> > > > > >>>>>>>> Jun > > > > >>>>>>>> > > > > >>>>>>>> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko < > [email protected] > > > > > > > >>>> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> Hi dear Kafka community, > > > > >>>>>>>>> > > > > >>>>>>>>> In the initial Diskless proposal, we proposed to have a > > > separate > > > > >>>>>>>>> component, batch/diskless coordinator, whose role would be > to > > > > >>>>>> centrally > > > > >>>>>>>>> manage the batch and WAL file metadata for diskless topics. > > > This > > > > >>>>>>> component > > > > >>>>>>>>> drew many reasonable comments from the community about how > it > > > > >>>> would > > > > >>>>>>> support > > > > >>>>>>>>> various Kafka features (transactions, queues) and its > > > > >>>> scalability. > > > > >>>>>>> While we > > > > >>>>>>>>> believe we have good answers to all the expressed concerns, > > we > > > > >>>> took a > > > > >>>>>>> step > > > > >>>>>>>>> back and looked at the problem from a different > perspective. > > > > >>>>>>>>> > > > > >>>>>>>>> We would like to propose an alternative Diskless design > > > *without > > > > >>>> a > > > > >>>>>>>>> centralized coordinator*. We believe this approach has > > > potential > > > > >>>> and > > > > >>>>>>>>> propose to discuss it as it may be more appealing to the > > > > >>>> community. > > > > >>>>>>>>> > > > > >>>>>>>>> Let us explain the idea. Most of the complications with the > > > > >>>> original > > > > >>>>>>>>> Diskless approach come from one necessary architecture > > change: > > > > >>>>>>> globalizing > > > > >>>>>>>>> the local state of partition leader in the batch > coordinator. > > > > >>>> This > > > > >>>>>>> causes > > > > >>>>>>>>> deviations to the established workflows in various features > > > like > > > > >>>>>>> produce > > > > >>>>>>>>> idempotence and transactions, queues, retention, etc. These > > > > >>>>>> deviations > > > > >>>>>>> need > > > > >>>>>>>>> to be carefully considered, designed, and later implemented > > and > > > > >>>>>>> tested. In > > > > >>>>>>>>> the new approach we want to avoid this by making partition > > > > >>>> leaders > > > > >>>>>>> again > > > > >>>>>>>>> responsible for managing their partitions, even in diskless > > > > >>>> topics. > > > > >>>>>>>>> > > > > >>>>>>>>> In classic Kafka topics, batch data and metadata are > blended > > > > >>>> together > > > > >>>>>>> in > > > > >>>>>>>>> the one partition log. The crux of the Diskless idea is to > > > > >>>> decouple > > > > >>>>>>> them > > > > >>>>>>>>> and move data to the remote storage, while keeping metadata > > > > >>>> somewhere > > > > >>>>>>> else. > > > > >>>>>>>>> Using the central batch coordinator for managing batch > > metadata > > > > >>>> is > > > > >>>>>> one > > > > >>>>>>> way, > > > > >>>>>>>>> but not the only. > > > > >>>>>>>>> > > > > >>>>>>>>> Let’s now think about managing metadata for each user > > partition > > > > >>>>>>>>> independently. Generally partitions are independent and > don’t > > > > >>>> share > > > > >>>>>>>>> anything apart from that their data are mixed in WAL files. > > If > > > we > > > > >>>>>>> figure > > > > >>>>>>>>> out how to commit and later delete WAL files safely, we > will > > > > >>>> achieve > > > > >>>>>>> the > > > > >>>>>>>>> necessary autonomy that allows us to get rid of the central > > > batch > > > > >>>>>>>>> coordinator. Instead, *each diskless user partition will be > > > > >>>> managed > > > > >>>>>> by > > > > >>>>>>> its > > > > >>>>>>>>> leader*, as in classic Kafka topics. Also like in classic > > > > >>>> topics, the > > > > >>>>>>>>> leader uses the partition log as the way to persist batch > > > > >>>> metadata, > > > > >>>>>>> i.e. > > > > >>>>>>>>> the regular batch header + the information about how to > find > > > this > > > > >>>>>>> batch on > > > > >>>>>>>>> remote storage. In contrast to classic topics, batch data > is > > in > > > > >>>>>> remote > > > > >>>>>>>>> storage. > > > > >>>>>>>>> > > > > >>>>>>>>> For clarity, let’s compare the three designs: > > > > >>>>>>>>> • Classic topics: > > > > >>>>>>>>> • Data and metadata are co-located in the partition log. > > > > >>>>>>>>> • The partition log content: [Batch header > (metadata)|Batch > > > > >>>> data]. > > > > >>>>>>>>> • The partition log is replicated to the followers. > > > > >>>>>>>>> • The replicas and leader have local state built from > > > > >>>> metadata. > > > > >>>>>>>>> • Original Diskless: > > > > >>>>>>>>> • Metadata is in the batch coordinator, data is on remote > > > > >>>> storage. > > > > >>>>>>>>> • The partition state is global in the batch coordinator. > > > > >>>>>>>>> • New Diskless: > > > > >>>>>>>>> • Metadata is in the partition log, data is on remote > > storage. > > > > >>>>>>>>> • Partition log content: [Batch header (metadata)|Batch > > > > >>>>>> coordinates > > > > >>>>>>> on > > > > >>>>>>>>> remote storage]. > > > > >>>>>>>>> • The partition log is replicated to the followers. > > > > >>>>>>>>> • The replicas and leader have local state built from > > > > >>>> metadata. > > > > >>>>>>>>> > > > > >>>>>>>>> Let’s consider the produce path. Here’s the reminder of the > > > > >>>> original > > > > >>>>>>>>> Diskless design: > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> The new approach could be depicted as the following: > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> As you can see, the main difference is that now instead of > a > > > > >>>> single > > > > >>>>>>> commit > > > > >>>>>>>>> request to the batch coordinator, we send multiple parallel > > > > >>>> commit > > > > >>>>>>> requests > > > > >>>>>>>>> to all the leaders of each partition involved in the WAL > > file. > > > > >>>> Each > > > > >>>>>> of > > > > >>>>>>> them > > > > >>>>>>>>> will commit its batches independently, without coordinating > > > with > > > > >>>>>> other > > > > >>>>>>>>> leaders and any other components. Batch data is addressed > by > > > the > > > > >>>> WAL > > > > >>>>>>> file > > > > >>>>>>>>> name, the byte offset and size, which allows partitions to > > know > > > > >>>>>> nothing > > > > >>>>>>>>> about other partitions to access their data in shared WAL > > > files. > > > > >>>>>>>>> > > > > >>>>>>>>> The number of partitions involved in a single WAL file may > be > > > > >>>> quite > > > > >>>>>>> large, > > > > >>>>>>>>> e.g. a hundred. A hundred network requests to commit one > WAL > > > > >>>> file is > > > > >>>>>>> very > > > > >>>>>>>>> impractical. However, there are ways to reduce this number: > > > > >>>>>>>>> 1. Partition leaders are located on brokers. Requests to > > > > >>>> leaders on > > > > >>>>>>> one > > > > >>>>>>>>> broker could be grouped together into a single physical > > network > > > > >>>>>> request > > > > >>>>>>>>> (resembling the normal Produce request that may carry > batches > > > for > > > > >>>>>> many > > > > >>>>>>>>> partitions inside). This will cap the number of network > > > requests > > > > >>>> to > > > > >>>>>> the > > > > >>>>>>>>> number of brokers in the cluster. > > > > >>>>>>>>> 2. If we craft the cluster metadata to make producers send > > > their > > > > >>>>>>> requests > > > > >>>>>>>>> to the right brokers (with respect to AZs), we may achieve > > the > > > > >>>> higher > > > > >>>>>>>>> concentration of logical commit requests in physical > network > > > > >>>> requests > > > > >>>>>>>>> reducing the number of the latter ones even further, > ideally > > to > > > > >>>> one. > > > > >>>>>>>>> > > > > >>>>>>>>> Obviously, out of multiple commit requests some may fail or > > > time > > > > >>>> out > > > > >>>>>>> for a > > > > >>>>>>>>> variety of reasons. This is fine. Some producers will > receive > > > > >>>> totally > > > > >>>>>>> or > > > > >>>>>>>>> partially failed responses to their Produce requests, > similar > > > to > > > > >>>> what > > > > >>>>>>> they > > > > >>>>>>>>> would have received when appending to a classic topic fails > > or > > > > >>>> times > > > > >>>>>>> out. > > > > >>>>>>>>> If a partition experiences problems, other partitions will > > not > > > be > > > > >>>>>>> affected > > > > >>>>>>>>> (again, like in classic topics). Of course, the uncommitted > > > data > > > > >>>> will > > > > >>>>>>> be > > > > >>>>>>>>> garbage in WAL files. But WAL files are short-lived > (batches > > > are > > > > >>>>>>> constantly > > > > >>>>>>>>> assembled into segments and offloaded to tiered storage), > so > > > this > > > > >>>>>>> garbage > > > > >>>>>>>>> will be eventually deleted. > > > > >>>>>>>>> > > > > >>>>>>>>> For safely deleting WAL files we now need to centrally > manage > > > > >>>> them, > > > > >>>>>> as > > > > >>>>>>>>> this is the only state and logic that spans multiple > > > partitions. > > > > >>>> On > > > > >>>>>> the > > > > >>>>>>>>> diagram, you can see another commit request called “Commit > > file > > > > >>>> (best > > > > >>>>>>>>> effort)” going to the WAL File Manager. This manager will > be > > > > >>>>>>> responsible > > > > >>>>>>>>> for the following: > > > > >>>>>>>>> 1. Collecting (by requests from brokers) and persisting > > > > >>>> information > > > > >>>>>>> about > > > > >>>>>>>>> committed WAL files. > > > > >>>>>>>>> 2. To handle potential failures in file information > delivery, > > > it > > > > >>>>>> will > > > > >>>>>>> be > > > > >>>>>>>>> doing prefix scan on the remote storage periodically to > find > > > and > > > > >>>>>>> register > > > > >>>>>>>>> unknown files. The period of this scan will be configurable > > and > > > > >>>>>> ideally > > > > >>>>>>>>> should be quite long. > > > > >>>>>>>>> 3. Checking with the relevant partition leaders (after a > > grace > > > > >>>>>>> period) if > > > > >>>>>>>>> they still have batches in a particular file. > > > > >>>>>>>>> 4. Physically deleting files when they aren’t anymore > > referred > > > > >>>> to by > > > > >>>>>>> any > > > > >>>>>>>>> partition. > > > > >>>>>>>>> > > > > >>>>>>>>> This new design offers the following advantages: > > > > >>>>>>>>> 1. It simplifies the implementation of many Kafka features > > such > > > > >>>> as > > > > >>>>>>>>> idempotence, transactions, queues, tiered storage, > retention. > > > > >>>> Now we > > > > >>>>>>> don’t > > > > >>>>>>>>> need to abstract away and reuse the code from partition > > leaders > > > > >>>> in > > > > >>>>>> the > > > > >>>>>>>>> batch coordinator. Instead, we will literally use the same > > code > > > > >>>> paths > > > > >>>>>>> in > > > > >>>>>>>>> leaders, with little adaptation. Workflows from classic > > topics > > > > >>>> mostly > > > > >>>>>>>>> remain unchanged. > > > > >>>>>>>>> For example, it seems that > > > > >>>>>>>>> ReplicaManager.maybeSendPartitionsToTransactionCoordinator > > and > > > > >>>>>>>>> KafkaApis.handleWriteTxnMarkersRequest used for transaction > > > > >>>> support > > > > >>>>>> on > > > > >>>>>>> the > > > > >>>>>>>>> partition leader side could be used for diskless topics > with > > > > >>>> little > > > > >>>>>>>>> adaptation. ProducerStateManager, needed for both > idempotent > > > > >>>> produce > > > > >>>>>>> and > > > > >>>>>>>>> transactions, would be reused. > > > > >>>>>>>>> Another example is share groups support, where the share > > > > >>>> partition > > > > >>>>>>> leader, > > > > >>>>>>>>> being co-located with the partition leader, would execute > the > > > > >>>> same > > > > >>>>>>> logic > > > > >>>>>>>>> for both diskless and classic topics. > > > > >>>>>>>>> 2. It returns to the familiar partition-based scaling > model, > > > > >>>> where > > > > >>>>>>>>> partitions are independent. > > > > >>>>>>>>> 3. It makes the operation and failure patterns closer to > the > > > > >>>>>> familiar > > > > >>>>>>>>> ones from classic topics. > > > > >>>>>>>>> 4. It opens a straightforward path to seamless switching > the > > > > >>>> topics > > > > >>>>>>> modes > > > > >>>>>>>>> between diskless and classic. > > > > >>>>>>>>> > > > > >>>>>>>>> The rest of the things remain unchanged compared to the > > > previous > > > > >>>>>>> Diskless > > > > >>>>>>>>> design (after all previous discussions). Such things as > local > > > > >>>> segment > > > > >>>>>>>>> materialization by replicas, the consume path, tiered > storage > > > > >>>>>>> integration, > > > > >>>>>>>>> etc. > > > > >>>>>>>>> > > > > >>>>>>>>> If the community finds this design more suitable, we will > > > update > > > > >>>> the > > > > >>>>>>>>> KIP(s) accordingly and continue working on it. Please let > us > > > know > > > > >>>>>> what > > > > >>>>>>> you > > > > >>>>>>>>> think. > > > > >>>>>>>>> > > > > >>>>>>>>> Best regards, > > > > >>>>>>>>> Ivan and Diskless team > > > > >>>>>>>>> > > > > >>>>>>>>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote: > > > > >>>>>>>>>> Hi Justine, > > > > >>>>>>>>>> > > > > >>>>>>>>>> Yes, you're right. We need to track the aborted > transactions > > > > >>>> for in > > > > >>>>>>> the > > > > >>>>>>>>> diskless coordinator for as long as the corresponding > offsets > > > are > > > > >>>>>>> there. > > > > >>>>>>>>> With the tiered storage unification Greg mentioned earlier, > > > this > > > > >>>> will > > > > >>>>>>> be > > > > >>>>>>>>> finite time even for infinite data retention. > > > > >>>>>>>>>> > > > > >>>>>>>>>> Best, > > > > >>>>>>>>>> Ivan > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote: > > > > >>>>>>>>>>> Hey Ivan, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Thanks for the response. I think most of what you said > made > > > > >>>>>> sense, > > > > >>>>>>> but > > > > >>>>>>>>> I > > > > >>>>>>>>>>> did have some questions about this part: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> As we understand this, the partition leader in classic > > > > >>>> topics > > > > >>>>>>> forgets > > > > >>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses > > > > >>>> it). The > > > > >>>>>>>>>>> transaction coordinator acts like the main guardian, > > allowing > > > > >>>>>>> partition > > > > >>>>>>>>>>> leaders to do this safely. Please correct me if this is > > > > >>>> wrong. We > > > > >>>>>>> think > > > > >>>>>>>>>>> about relying on this with the batch coordinator and > delete > > > > >>>> the > > > > >>>>>>>>> information > > > > >>>>>>>>>>> about a transaction once it’s finished (as there’s no > > > > >>>> replication > > > > >>>>>>> and > > > > >>>>>>>>> HWM > > > > >>>>>>>>>>> advances immediately). > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> I didn't quite understand this. In classic topics, we > have > > > > >>>> maps > > > > >>>>>> for > > > > >>>>>>>>> ongoing > > > > >>>>>>>>>>> transactions which remove state when the transaction is > > > > >>>> completed > > > > >>>>>>> and > > > > >>>>>>>>> an > > > > >>>>>>>>>>> aborted transactions index which is retained for much > > longer. > > > > >>>>>> Once > > > > >>>>>>> the > > > > >>>>>>>>>>> transaction is completed, the coordinator is no longer > > > > >>>> involved > > > > >>>>>> in > > > > >>>>>>>>>>> maintaining this partition side state, and it is subject > to > > > > >>>>>>> compaction > > > > >>>>>>>>> etc. > > > > >>>>>>>>>>> Looking back at the outline provided above, I didn't see > > much > > > > >>>>>>> about the > > > > >>>>>>>>>>> fetch path, so maybe that could be expanded a bit > further. > > I > > > > >>>> saw > > > > >>>>>>> the > > > > >>>>>>>>>>> following in a response: > > > > >>>>>>>>>>>> When the broker constructs a fully valid local segment, > > > > >>>> all the > > > > >>>>>>>>> necessary > > > > >>>>>>>>>>> control batches will be inserted and indices, including > the > > > > >>>>>>> transaction > > > > >>>>>>>>>>> index will be built to serve FetchRequests exactly as > they > > > > >>>> are > > > > >>>>>>> today. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Based on this, it seems like we need to retain the > > > > >>>> information > > > > >>>>>>> about > > > > >>>>>>>>>>> aborted txns for longer. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Thanks, > > > > >>>>>>>>>>> Justine > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko < > > > > >>>> [email protected]> > > > > >>>>>>> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> Hi Justine and all, > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Thank you for your questions! > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified > > > > >>>> with > > > > >>>>>>>>> producer ID > > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be > > > > >>>> cached > > > > >>>>>>>>> locally > > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2 > > > > >>>>>> transactions > > > > >>>>>>> can > > > > >>>>>>>>> be > > > > >>>>>>>>>>>> used > > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions > > > > >>>> with > > > > >>>>>>>>> producer id + > > > > >>>>>>>>>>>>> epoch > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> You’re right that we (probably unintentionally) focused > > > > >>>> only on > > > > >>>>>>>>> version 2. > > > > >>>>>>>>>>>> We can either limit the support to version 2 or consider > > > > >>>> using > > > > >>>>>>> some > > > > >>>>>>>>>>>> surrogates to support version 1. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final > transactional > > > > >>>>>>> checks > > > > >>>>>>>>> of the > > > > >>>>>>>>>>>>> batches. This procedure would output the same errors > > > > >>>> like the > > > > >>>>>>>>> partition > > > > >>>>>>>>>>>>> leader in classic topics would do. > > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be > > > > >>>>>> checking > > > > >>>>>>> if > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>> transaction was still ongoing for example?* * > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Yes, the producer epoch, that the transaction is > ongoing, > > > > >>>> and > > > > >>>>>> of > > > > >>>>>>>>> course > > > > >>>>>>>>>>>> the normal idempotence checks. What the partition leader > > > > >>>> in the > > > > >>>>>>>>> classic > > > > >>>>>>>>>>>> topics does before appending a batch to the local log > > > > >>>> (e.g. in > > > > >>>>>>>>>>>> UnifiedLog.maybeStartTransactionVerification and > > > > >>>>>>>>>>>> UnifiedLog.analyzeAndValidateProducerState). In > Diskless, > > > > >>>> we > > > > >>>>>>>>> unfortunately > > > > >>>>>>>>>>>> cannot do these checks before appending the data to the > > WAL > > > > >>>>>>> segment > > > > >>>>>>>>> and > > > > >>>>>>>>>>>> uploading it, but we can “tombstone” these batches in > the > > > > >>>> batch > > > > >>>>>>>>> coordinator > > > > >>>>>>>>>>>> during the final commit. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> Is there state about ongoing > > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other > > > > >>>> state > > > > >>>>>>>>> mentioned > > > > >>>>>>>>>>>> in > > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear > > > > >>>> what > > > > >>>>>>> state is > > > > >>>>>>>>>>>> stored > > > > >>>>>>>>>>>>> and when it is stored. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Right, this should have been more explicit. As the > > > > >>>> partition > > > > >>>>>>> leader > > > > >>>>>>>>> tracks > > > > >>>>>>>>>>>> ongoing transactions for classic topics, the batch > > > > >>>> coordinator > > > > >>>>>>> has > > > > >>>>>>>>> to as > > > > >>>>>>>>>>>> well. So when a transaction starts and ends, the > > > > >>>> transaction > > > > >>>>>>>>> coordinator > > > > >>>>>>>>>>>> must inform the batch coordinator about this. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO -- > > > > >>>> perhaps > > > > >>>>>>> that > > > > >>>>>>>>> would > > > > >>>>>>>>>>>> be > > > > >>>>>>>>>>>>> stored in the batch coordinator? > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Yes. This could be deduced from the committed batches > and > > > > >>>> other > > > > >>>>>>>>>>>> information, but for the sake of performance we’d better > > > > >>>> store > > > > >>>>>> it > > > > >>>>>>>>>>>> explicitly. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long > transactional > > > > >>>>>>> state is > > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be > > > > >>>>>> cleaned > > > > >>>>>>> up? > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> As we understand this, the partition leader in classic > > > > >>>> topics > > > > >>>>>>> forgets > > > > >>>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses > > > > >>>> it). > > > > >>>>>> The > > > > >>>>>>>>>>>> transaction coordinator acts like the main guardian, > > > > >>>> allowing > > > > >>>>>>>>> partition > > > > >>>>>>>>>>>> leaders to do this safely. Please correct me if this is > > > > >>>> wrong. > > > > >>>>>> We > > > > >>>>>>>>> think > > > > >>>>>>>>>>>> about relying on this with the batch coordinator and > > > > >>>> delete the > > > > >>>>>>>>> information > > > > >>>>>>>>>>>> about a transaction once it’s finished (as there’s no > > > > >>>>>> replication > > > > >>>>>>>>> and HWM > > > > >>>>>>>>>>>> advances immediately). > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Best, > > > > >>>>>>>>>>>> Ivan > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote: > > > > >>>>>>>>>>>>> Hey folks, > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Excited to see some updates related to transactions! > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I had a few questions. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified > > > > >>>> with > > > > >>>>>>>>> producer ID > > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be > > > > >>>> cached > > > > >>>>>>>>> locally > > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2 > > > > >>>>>> transactions > > > > >>>>>>> can > > > > >>>>>>>>> be > > > > >>>>>>>>>>>> used > > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions > > > > >>>> with > > > > >>>>>>>>> producer id + > > > > >>>>>>>>>>>>> epoch > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final > transactional > > > > >>>>>>> checks > > > > >>>>>>>>> of the > > > > >>>>>>>>>>>>> batches. This procedure would output the same errors > > > > >>>> like the > > > > >>>>>>>>> partition > > > > >>>>>>>>>>>>> leader in classic topics would do. > > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be > > > > >>>>>> checking > > > > >>>>>>> if > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>> transaction was still ongoing for example? Is there > state > > > > >>>>>> about > > > > >>>>>>>>> ongoing > > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other > > > > >>>> state > > > > >>>>>>>>> mentioned > > > > >>>>>>>>>>>> in > > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear > > > > >>>> what > > > > >>>>>>> state is > > > > >>>>>>>>>>>> stored > > > > >>>>>>>>>>>>> and when it is stored. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO -- > > > > >>>> perhaps > > > > >>>>>>> that > > > > >>>>>>>>> would > > > > >>>>>>>>>>>> be > > > > >>>>>>>>>>>>> stored in the batch coordinator? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long > transactional > > > > >>>>>>> state is > > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be > > > > >>>>>> cleaned > > > > >>>>>>> up? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao > > > > >>>>>>> <[email protected]> > > > > >>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Hi, Greg and Ivan, > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Thanks for the update. A few comments. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> JR 10. "Consumer fetches are now served from local > > > > >>>>>> segments, > > > > >>>>>>>>> making > > > > >>>>>>>>>>>> use of > > > > >>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>> indexes, page cache, request purgatory, and zero-copy > > > > >>>>>>>>> functionality > > > > >>>>>>>>>>>> already > > > > >>>>>>>>>>>>>> built into classic topics." > > > > >>>>>>>>>>>>>> JR 10.1 Does the broker build the producer state for > > > > >>>> each > > > > >>>>>>>>> partition in > > > > >>>>>>>>>>>>>> diskless topics? > > > > >>>>>>>>>>>>>> JR 10.2 For transactional data, the consumer fetches > > > > >>>> need > > > > >>>>>> to > > > > >>>>>>> know > > > > >>>>>>>>>>>> aborted > > > > >>>>>>>>>>>>>> records. How is that achieved? > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> JR 11. "The batch coordinator saves that the > > > > >>>> transaction is > > > > >>>>>>>>> finished > > > > >>>>>>>>>>>> and > > > > >>>>>>>>>>>>>> also inserts the control batches in the corresponding > > > > >>>> logs > > > > >>>>>>> of the > > > > >>>>>>>>>>>> involved > > > > >>>>>>>>>>>>>> Diskless topics. This happens only on the metadata > > > > >>>> level, > > > > >>>>>> no > > > > >>>>>>>>> actual > > > > >>>>>>>>>>>> control > > > > >>>>>>>>>>>>>> batches are written to any file. " > > > > >>>>>>>>>>>>>> A fetch response could include multiple transactional > > > > >>>>>>> batches. > > > > >>>>>>>>> How > > > > >>>>>>>>>>>> does the > > > > >>>>>>>>>>>>>> broker obtain the information about the ending control > > > > >>>>>> batch > > > > >>>>>>> for > > > > >>>>>>>>> each > > > > >>>>>>>>>>>>>> batch? Does that mean that a fetch response needs to > be > > > > >>>>>>> built by > > > > >>>>>>>>>>>>>> stitching record batches and generated control batches > > > > >>>>>>> together? > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> JR 12. Queues: Is there still a share partition leader > > > > >>>> that > > > > >>>>>>> all > > > > >>>>>>>>>>>> consumers > > > > >>>>>>>>>>>>>> are routed to? > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> JR 13. "Should the KIPs be modified to include this or > > > > >>>> it's > > > > >>>>>>> too > > > > >>>>>>>>>>>>>> implementation-focused?" It would be useful to include > > > > >>>>>> enough > > > > >>>>>>>>> details > > > > >>>>>>>>>>>> to > > > > >>>>>>>>>>>>>> understand correctness and performance impact. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> HC5. Henry has a valid point. Requests from a given > > > > >>>>>> producer > > > > >>>>>>>>> contain a > > > > >>>>>>>>>>>>>> sequence number, which is ordered. If a producer sends > > > > >>>>>> every > > > > >>>>>>>>> Produce > > > > >>>>>>>>>>>>>> request to an arbitrary broker, those requests could > > > > >>>> reach > > > > >>>>>>> the > > > > >>>>>>>>> batch > > > > >>>>>>>>>>>>>> coordinator in different order and lead to rejection > > > > >>>> of the > > > > >>>>>>>>> produce > > > > >>>>>>>>>>>>>> requests. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Jun > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko < > > > > >>>>>>> [email protected]> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Hi all, > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> We have also thought in a bit more details about > > > > >>>>>>> transactions > > > > >>>>>>>>> and > > > > >>>>>>>>>>>> queues, > > > > >>>>>>>>>>>>>>> here's the plan. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> *Transactions* > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> The support for transactions in *classic topics* is > > > > >>>> based > > > > >>>>>>> on > > > > >>>>>>>>> precise > > > > >>>>>>>>>>>>>>> interactions between three actors: clients (mostly > > > > >>>>>>> producers, > > > > >>>>>>>>> but > > > > >>>>>>>>>>>> also > > > > >>>>>>>>>>>>>>> consumers), brokers (ReplicaManager and other > > > > >>>> classes), > > > > >>>>>> and > > > > >>>>>>>>>>>> transaction > > > > >>>>>>>>>>>>>>> coordinators. Brokers also run partition leaders with > > > > >>>>>> their > > > > >>>>>>>>> local > > > > >>>>>>>>>>>> state > > > > >>>>>>>>>>>>>>> (ProducerStateManager and others). > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> The high level (some details skipped) workflow is the > > > > >>>>>>>>> following. > > > > >>>>>>>>>>>> When a > > > > >>>>>>>>>>>>>>> transactional Produce request is received by the > > > > >>>> broker: > > > > >>>>>>>>>>>>>>> 1. For each partition, the partition leader checks > > > > >>>> if a > > > > >>>>>>>>> non-empty > > > > >>>>>>>>>>>>>>> transaction is running for this partition. This is > > > > >>>> done > > > > >>>>>>> using > > > > >>>>>>>>> its > > > > >>>>>>>>>>>> local > > > > >>>>>>>>>>>>>>> state derived from the log metadata > > > > >>>>>> (ProducerStateManager, > > > > >>>>>>>>>>>>>>> VerificationStateEntry, VerificationGuard). > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator is informed about all > > > > >>>> the > > > > >>>>>>>>> partitions > > > > >>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>> aren’t part of the transaction to include them. > > > > >>>>>>>>>>>>>>> 3. The partition leaders do additional transactional > > > > >>>>>>> checks. > > > > >>>>>>>>>>>>>>> 4. The partition leaders append the transactional > > > > >>>> data to > > > > >>>>>>>>> their logs > > > > >>>>>>>>>>>> and > > > > >>>>>>>>>>>>>>> update some of their state (for example, log the fact > > > > >>>>>> that > > > > >>>>>>> the > > > > >>>>>>>>>>>>>> transaction > > > > >>>>>>>>>>>>>>> is running for the partition and its first offset). > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted: > > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator > > > > >>>>>>> directly > > > > >>>>>>>>> with > > > > >>>>>>>>>>>>>>> EndTxnRequest. > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT > > > > >>>> or > > > > >>>>>>>>>>>> PREPARE_ABORT to > > > > >>>>>>>>>>>>>>> its log and responds to the producer. > > > > >>>>>>>>>>>>>>> 3. The transaction coordinator sends > > > > >>>>>>> WriteTxnMarkersRequest to > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>>> leaders > > > > >>>>>>>>>>>>>>> of the involved partitions. > > > > >>>>>>>>>>>>>>> 4. The partition leaders write the transaction > > > > >>>> markers to > > > > >>>>>>>>> their logs > > > > >>>>>>>>>>>> and > > > > >>>>>>>>>>>>>>> respond to the coordinator. > > > > >>>>>>>>>>>>>>> 5. The coordinator writes the final transaction state > > > > >>>>>>>>>>>> COMPLETE_COMMIT or > > > > >>>>>>>>>>>>>>> COMPLETE_ABORT. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> In classic topics, partitions have leaders and lots > > > > >>>> of > > > > >>>>>>>>> important > > > > >>>>>>>>>>>> state > > > > >>>>>>>>>>>>>>> necessary for supporting this workflow is local. The > > > > >>>> main > > > > >>>>>>>>> challenge > > > > >>>>>>>>>>>> in > > > > >>>>>>>>>>>>>>> mapping this to Diskless comes from the fact there > > > > >>>> are no > > > > >>>>>>>>> partition > > > > >>>>>>>>>>>>>>> leaders, so the corresponding pieces of state need > > > > >>>> to be > > > > >>>>>>>>> globalized > > > > >>>>>>>>>>>> in > > > > >>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>> batch coordinator. We are already doing this to > > > > >>>> support > > > > >>>>>>>>> idempotent > > > > >>>>>>>>>>>>>> produce. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> The high level workflow for *diskless topics* would > > > > >>>> look > > > > >>>>>>> very > > > > >>>>>>>>>>>> similar: > > > > >>>>>>>>>>>>>>> 1. For each partition, the broker checks if a > > > > >>>> non-empty > > > > >>>>>>>>> transaction > > > > >>>>>>>>>>>> is > > > > >>>>>>>>>>>>>>> running for this partition. In contrast to classic > > > > >>>>>> topics, > > > > >>>>>>>>> this is > > > > >>>>>>>>>>>>>> checked > > > > >>>>>>>>>>>>>>> against the batch coordinator with a single RPC. > > > > >>>> Since a > > > > >>>>>>>>> transaction > > > > >>>>>>>>>>>>>> could > > > > >>>>>>>>>>>>>>> be uniquely identified with producer ID and epoch, > > > > >>>> the > > > > >>>>>>> positive > > > > >>>>>>>>>>>> result of > > > > >>>>>>>>>>>>>>> this check could be cached locally (for the double > > > > >>>>>>> configured > > > > >>>>>>>>>>>> duration > > > > >>>>>>>>>>>>>> of a > > > > >>>>>>>>>>>>>>> transaction, for example). > > > > >>>>>>>>>>>>>>> 2. The same: The transaction coordinator is informed > > > > >>>>>> about > > > > >>>>>>> all > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>>>> partitions that aren’t part of the transaction to > > > > >>>> include > > > > >>>>>>> them. > > > > >>>>>>>>>>>>>>> 3. No transactional checks are done on the broker > > > > >>>> side. > > > > >>>>>>>>>>>>>>> 4. The broker appends the transactional data to the > > > > >>>>>> current > > > > >>>>>>>>> shared > > > > >>>>>>>>>>>> WAL > > > > >>>>>>>>>>>>>>> segment. It doesn’t update any transaction-related > > > > >>>> state > > > > >>>>>>> for > > > > >>>>>>>>> Diskless > > > > >>>>>>>>>>>>>>> topics, because it doesn’t have any. > > > > >>>>>>>>>>>>>>> 5. The WAL segment is committed to the batch > > > > >>>> coordinator > > > > >>>>>>> like > > > > >>>>>>>>> in the > > > > >>>>>>>>>>>>>>> normal produce flow. > > > > >>>>>>>>>>>>>>> 6. The batch coordinator does the final transactional > > > > >>>>>>> checks > > > > >>>>>>>>> of the > > > > >>>>>>>>>>>>>>> batches. This procedure would output the same errors > > > > >>>> like > > > > >>>>>>> the > > > > >>>>>>>>>>>> partition > > > > >>>>>>>>>>>>>>> leader in classic topics would do. I.e. some batches > > > > >>>>>> could > > > > >>>>>>> be > > > > >>>>>>>>>>>> rejected. > > > > >>>>>>>>>>>>>>> This means, there will potentially be garbage in the > > > > >>>> WAL > > > > >>>>>>>>> segment > > > > >>>>>>>>>>>> file in > > > > >>>>>>>>>>>>>>> case of transactional errors. This is preferable to > > > > >>>> doing > > > > >>>>>>> more > > > > >>>>>>>>>>>> network > > > > >>>>>>>>>>>>>>> round trips, especially considering the WAL segments > > > > >>>> will > > > > >>>>>>> be > > > > >>>>>>>>>>>> relatively > > > > >>>>>>>>>>>>>>> short-living (see the Greg's update above). > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted: > > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator > > > > >>>>>>> directly > > > > >>>>>>>>> with > > > > >>>>>>>>>>>>>>> EndTxnRequest. > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT > > > > >>>> or > > > > >>>>>>>>>>>> PREPARE_ABORT to > > > > >>>>>>>>>>>>>>> its log and responds to the producer. > > > > >>>>>>>>>>>>>>> 3. *[NEW]* The transaction coordinator informs the > > > > >>>> batch > > > > >>>>>>>>> coordinator > > > > >>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>> the transaction is finished. > > > > >>>>>>>>>>>>>>> 4. *[NEW]* The batch coordinator saves that the > > > > >>>>>>> transaction is > > > > >>>>>>>>>>>> finished > > > > >>>>>>>>>>>>>>> and also inserts the control batches in the > > > > >>>> corresponding > > > > >>>>>>> logs > > > > >>>>>>>>> of the > > > > >>>>>>>>>>>>>>> involved Diskless topics. This happens only on the > > > > >>>>>> metadata > > > > >>>>>>>>> level, no > > > > >>>>>>>>>>>>>>> actual control batches are written to any file. They > > > > >>>> will > > > > >>>>>>> be > > > > >>>>>>>>>>>> dynamically > > > > >>>>>>>>>>>>>>> created on Fetch and other read operations. We could > > > > >>>>>>>>> technically > > > > >>>>>>>>>>>> write > > > > >>>>>>>>>>>>>>> these control batches for real, but this would mean > > > > >>>> extra > > > > >>>>>>>>> produce > > > > >>>>>>>>>>>>>> latency, > > > > >>>>>>>>>>>>>>> so it's better just to mark them in the batch > > > > >>>> coordinator > > > > >>>>>>> and > > > > >>>>>>>>> save > > > > >>>>>>>>>>>> these > > > > >>>>>>>>>>>>>>> milliseconds. > > > > >>>>>>>>>>>>>>> 5. The transaction coordinator sends > > > > >>>>>>> WriteTxnMarkersRequest to > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>>> leaders > > > > >>>>>>>>>>>>>>> of the involved partitions. – Now only to classic > > > > >>>> topics > > > > >>>>>>> now. > > > > >>>>>>>>>>>>>>> 6. The partition leaders of classic topics write the > > > > >>>>>>>>> transaction > > > > >>>>>>>>>>>> markers > > > > >>>>>>>>>>>>>>> to their logs and respond to the coordinator. > > > > >>>>>>>>>>>>>>> 7. The coordinator writes the final transaction state > > > > >>>>>>>>>>>> COMPLETE_COMMIT or > > > > >>>>>>>>>>>>>>> COMPLETE_ABORT. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Compared to the non-transactional produce flow, we > > > > >>>> get: > > > > >>>>>>>>>>>>>>> 1. An extra network round trip between brokers and > > > > >>>> the > > > > >>>>>>> batch > > > > >>>>>>>>>>>> coordinator > > > > >>>>>>>>>>>>>>> when a new partition appear in the transaction. To > > > > >>>>>>> mitigate the > > > > >>>>>>>>>>>> impact of > > > > >>>>>>>>>>>>>>> them: > > > > >>>>>>>>>>>>>>> - The results will be cached. > > > > >>>>>>>>>>>>>>> - The calls for multiple partitions in one Produce > > > > >>>>>>> request > > > > >>>>>>>>> will be > > > > >>>>>>>>>>>>>>> grouped. > > > > >>>>>>>>>>>>>>> - The batch coordinator should be optimized for > > > > >>>> fast > > > > >>>>>>>>> response to > > > > >>>>>>>>>>>> these > > > > >>>>>>>>>>>>>>> RPCs. > > > > >>>>>>>>>>>>>>> - The fact that a single producer normally will > > > > >>>>>>> communicate > > > > >>>>>>>>> with a > > > > >>>>>>>>>>>>>>> single broker for the duration of the transaction > > > > >>>> further > > > > >>>>>>>>> reduces the > > > > >>>>>>>>>>>>>>> expected number of round trips. > > > > >>>>>>>>>>>>>>> 2. An extra round trip between the transaction > > > > >>>>>> coordinator > > > > >>>>>>> and > > > > >>>>>>>>> batch > > > > >>>>>>>>>>>>>>> coordinator when a transaction is finished. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> With this proposal, transactions will also be able to > > > > >>>>>> span > > > > >>>>>>> both > > > > >>>>>>>>>>>> classic > > > > >>>>>>>>>>>>>>> and Diskless topics. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> *Queues* > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> The share group coordination and management is a > > > > >>>> side job > > > > >>>>>>> that > > > > >>>>>>>>>>>> doesn't > > > > >>>>>>>>>>>>>>> interfere with the topic itself (leadership, > > > > >>>> replicas, > > > > >>>>>>> physical > > > > >>>>>>>>>>>> storage > > > > >>>>>>>>>>>>>> of > > > > >>>>>>>>>>>>>>> records, etc.) and non-queue producers and consumers > > > > >>>>>>> (Fetch and > > > > >>>>>>>>>>>> Produce > > > > >>>>>>>>>>>>>>> RPCs, consumer group-related RPCs are not affected.) > > > > >>>> We > > > > >>>>>>> don't > > > > >>>>>>>>> see any > > > > >>>>>>>>>>>>>>> reason why we can't make Diskless topics compatible > > > > >>>> with > > > > >>>>>>> share > > > > >>>>>>>>>>>> groups the > > > > >>>>>>>>>>>>>>> same way as classic topics are. Even on the code > > > > >>>> level, > > > > >>>>>> we > > > > >>>>>>>>> don't > > > > >>>>>>>>>>>> expect > > > > >>>>>>>>>>>>>> any > > > > >>>>>>>>>>>>>>> serious refactoring: the same reading routines are > > > > >>>> used > > > > >>>>>>> that > > > > >>>>>>>>> are > > > > >>>>>>>>>>>> used for > > > > >>>>>>>>>>>>>>> fetching (e.g. ReplicaManager.readFromLog). > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Should the KIPs be modified to include this or it's > > > > >>>> too > > > > >>>>>>>>>>>>>>> implementation-focused? > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Best regards, > > > > >>>>>>>>>>>>>>> Ivan > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote: > > > > >>>>>>>>>>>>>>>> Hi all, > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Thank you all for your questions and design input > > > > >>>> on > > > > >>>>>>>>> KIP-1150. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> We have just updated KIP-1150 and KIP-1163 with a > > > > >>>> new > > > > >>>>>>>>> design. To > > > > >>>>>>>>>>>>>>> summarize > > > > >>>>>>>>>>>>>>>> the changes: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> 1. The design prioritizes integrating with the > > > > >>>> existing > > > > >>>>>>>>> KIP-405 > > > > >>>>>>>>>>>> Tiered > > > > >>>>>>>>>>>>>>>> Storage interfaces, permitting data produced to a > > > > >>>>>>> Diskless > > > > >>>>>>>>> topic > > > > >>>>>>>>>>>> to be > > > > >>>>>>>>>>>>>>>> moved to tiered storage. > > > > >>>>>>>>>>>>>>>> This lowers the scalability requirements for the > > > > >>>> Batch > > > > >>>>>>>>> Coordinator > > > > >>>>>>>>>>>>>>>> component, and allows Diskless to compose with > > > > >>>> Tiered > > > > >>>>>>> Storage > > > > >>>>>>>>>>>> plugin > > > > >>>>>>>>>>>>>>>> features such as encryption and alternative data > > > > >>>>>> formats. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> 2. Consumer fetches are now served from local > > > > >>>> segments, > > > > >>>>>>>>> making use > > > > >>>>>>>>>>>> of > > > > >>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>> indexes, page cache, request purgatory, and > > > > >>>> zero-copy > > > > >>>>>>>>> functionality > > > > >>>>>>>>>>>>>>> already > > > > >>>>>>>>>>>>>>>> built into classic topics. > > > > >>>>>>>>>>>>>>>> However, local segments are now considered cache > > > > >>>>>>> elements, > > > > >>>>>>>>> do not > > > > >>>>>>>>>>>> need > > > > >>>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>>> be durably stored, and can be built without > > > > >>>> contacting > > > > >>>>>>> any > > > > >>>>>>>>> other > > > > >>>>>>>>>>>>>>> replicas. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> 3. The design has been simplified substantially, by > > > > >>>>>>> removing > > > > >>>>>>>>> the > > > > >>>>>>>>>>>>>> previous > > > > >>>>>>>>>>>>>>>> Diskless consume flow, distributed cache > > > > >>>> component, and > > > > >>>>>>>>> "object > > > > >>>>>>>>>>>>>>>> compaction/merging" step. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> The design maintains leaderless produces as > > > > >>>> enabled by > > > > >>>>>>> the > > > > >>>>>>>>> Batch > > > > >>>>>>>>>>>>>>>> Coordinator, and the same latency profiles as the > > > > >>>>>> earlier > > > > >>>>>>>>> design, > > > > >>>>>>>>>>>> while > > > > >>>>>>>>>>>>>>>> being simpler and integrating better into the > > > > >>>> existing > > > > >>>>>>>>> ecosystem. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Thanks, and we are eager to hear your feedback on > > > > >>>> the > > > > >>>>>> new > > > > >>>>>>>>> design. > > > > >>>>>>>>>>>>>>>> Greg Harris > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao > > > > >>>>>>>>> <[email protected]> > > > > >>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> Hi, Jan, > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> For me, the main gap of KIP-1150 is the support > > > > >>>> of > > > > >>>>>> all > > > > >>>>>>>>> existing > > > > >>>>>>>>>>>>>> client > > > > >>>>>>>>>>>>>>>>> APIs. Currently, there is no design for > > > > >>>> supporting > > > > >>>>>> APIs > > > > >>>>>>>>> like > > > > >>>>>>>>>>>>>>> transactions > > > > >>>>>>>>>>>>>>>>> and queues. > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> Thanks, > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> Jun > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski > > > > >>>>>>>>>>>>>>>>> <[email protected]> wrote: > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> Would it be a good time to ask for the current > > > > >>>>>>> status of > > > > >>>>>>>>> this > > > > >>>>>>>>>>>> KIP? > > > > >>>>>>>>>>>>>> I > > > > >>>>>>>>>>>>>>>>>> haven't seen much activity here for the past 2 > > > > >>>>>>> months, > > > > >>>>>>>>> the > > > > >>>>>>>>>>>> vote got > > > > >>>>>>>>>>>>>>>>> vetoed > > > > >>>>>>>>>>>>>>>>>> but I think the pending questions have been > > > > >>>>>> answered > > > > >>>>>>>>> since > > > > >>>>>>>>>>>> then. > > > > >>>>>>>>>>>>>>> KIP-1183 > > > > >>>>>>>>>>>>>>>>>> (AutoMQ's proposal) also didn't have any > > > > >>>> activity > > > > >>>>>>> since > > > > >>>>>>>>> May. > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> In my eyes KIP-1150 and KIP-1183 are two real > > > > >>>>>> choices > > > > >>>>>>>>> that can > > > > >>>>>>>>>>>> be > > > > >>>>>>>>>>>>>>>>>> made, with a coordinator-based approach being > > > > >>>> by > > > > >>>>>> far > > > > >>>>>>> the > > > > >>>>>>>>>>>> dominant > > > > >>>>>>>>>>>>>> one > > > > >>>>>>>>>>>>>>>>> when > > > > >>>>>>>>>>>>>>>>>> it comes to market adoption - but all these are > > > > >>>>>>>>> standalone > > > > >>>>>>>>>>>>>> products. > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> I'm a big fan of both approaches, but would > > > > >>>> hate to > > > > >>>>>>> see a > > > > >>>>>>>>>>>> stall. So > > > > >>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>> question is: can we get an update? > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> Maybe it's time to start another vote? Colin > > > > >>>>>> McCabe - > > > > >>>>>>>>> have your > > > > >>>>>>>>>>>>>>> questions > > > > >>>>>>>>>>>>>>>>>> been answered? If not, is there anything I can > > > > >>>> do > > > > >>>>>> to > > > > >>>>>>>>> help? I'm > > > > >>>>>>>>>>>>>> deeply > > > > >>>>>>>>>>>>>>>>>> familiar with both architectures and have > > > > >>>> written > > > > >>>>>>> about > > > > >>>>>>>>> both? > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> Kind regards, > > > > >>>>>>>>>>>>>>>>>> Jan > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2025 at 10:42 AM Stanislav > > > > >>>>>> Kozlovski > > > > >>>>>>> < > > > > >>>>>>>>>>>>>>>>>> [email protected]> wrote: > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> I have some nits - it may be useful to > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> a) group all the KIP email threads in the > > > > >>>> main > > > > >>>>>> one > > > > >>>>>>>>> (just a > > > > >>>>>>>>>>>> bunch > > > > >>>>>>>>>>>>>> of > > > > >>>>>>>>>>>>>>>>> links > > > > >>>>>>>>>>>>>>>>>>> to everything) > > > > >>>>>>>>>>>>>>>>>>> b) create the email threads > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> It's a bit hard to track it all - for > > > > >>>> example, I > > > > >>>>>>> was > > > > >>>>>>>>>>>> searching > > > > >>>>>>>>>>>>>> for > > > > >>>>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>>>>>> discuss thread for KIP-1165 for a while; As > > > > >>>> far > > > > >>>>>> as > > > > >>>>>>> I > > > > >>>>>>>>> can > > > > >>>>>>>>>>>> tell, it > > > > >>>>>>>>>>>>>>>>> doesn't > > > > >>>>>>>>>>>>>>>>>>> exist yet. > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> Since the KIPs are published (by virtue of > > > > >>>> having > > > > >>>>>>> the > > > > >>>>>>>>> root > > > > >>>>>>>>>>>> KIP be > > > > >>>>>>>>>>>>>>>>>>> published, having a DISCUSS thread and links > > > > >>>> to > > > > >>>>>>>>> sub-KIPs > > > > >>>>>>>>>>>> where > > > > >>>>>>>>>>>>>> were > > > > >>>>>>>>>>>>>>>>> aimed > > > > >>>>>>>>>>>>>>>>>>> to move the discussion towards), I think it > > > > >>>> would > > > > >>>>>>> be > > > > >>>>>>>>> good to > > > > >>>>>>>>>>>>>> create > > > > >>>>>>>>>>>>>>>>>> DISCUSS > > > > >>>>>>>>>>>>>>>>>>> threads for them all. > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> Best, > > > > >>>>>>>>>>>>>>>>>>> Stan > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> On 2025/04/16 11:58:22 Josep Prat wrote: > > > > >>>>>>>>>>>>>>>>>>>> Hi Kafka Devs! > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> We want to start a new KIP discussion about > > > > >>>>>>>>> introducing a > > > > >>>>>>>>>>>> new > > > > >>>>>>>>>>>>>>> type of > > > > >>>>>>>>>>>>>>>>>>>> topics that would make use of Object > > > > >>>> Storage as > > > > >>>>>>> the > > > > >>>>>>>>> primary > > > > >>>>>>>>>>>>>>> source of > > > > >>>>>>>>>>>>>>>>>>>> storage. However, as this KIP is big we > > > > >>>> decided > > > > >>>>>>> to > > > > >>>>>>>>> split it > > > > >>>>>>>>>>>>>> into > > > > >>>>>>>>>>>>>>>>>> multiple > > > > >>>>>>>>>>>>>>>>>>>> related KIPs. > > > > >>>>>>>>>>>>>>>>>>>> We have the motivational KIP-1150 ( > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > > >>>>>>>>>>>>>>>>>>> ) > > > > >>>>>>>>>>>>>>>>>>>> that aims to discuss if Apache Kafka > > > > >>>> should aim > > > > >>>>>>> to > > > > >>>>>>>>> have > > > > >>>>>>>>>>>> this > > > > >>>>>>>>>>>>>>> type of > > > > >>>>>>>>>>>>>>>>>>>> feature at all. This KIP doesn't go onto > > > > >>>>>> details > > > > >>>>>>> on > > > > >>>>>>>>> how to > > > > >>>>>>>>>>>>>>> implement > > > > >>>>>>>>>>>>>>>>>> it. > > > > >>>>>>>>>>>>>>>>>>>> This follows the same approach used when we > > > > >>>>>>> discussed > > > > >>>>>>>>>>>> KRaft. > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> But as we know that it is sometimes really > > > > >>>> hard > > > > >>>>>>> to > > > > >>>>>>>>> discuss > > > > >>>>>>>>>>>> on > > > > >>>>>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>>>> meta > > > > >>>>>>>>>>>>>>>>>>>> level, we also created several sub-kips > > > > >>>> (linked > > > > >>>>>>> in > > > > >>>>>>>>>>>> KIP-1150) > > > > >>>>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>>>> offer > > > > >>>>>>>>>>>>>>>>>>> an > > > > >>>>>>>>>>>>>>>>>>>> implementation of this feature. > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> We kindly ask you to use the proper DISCUSS > > > > >>>>>>> threads > > > > >>>>>>>>> for > > > > >>>>>>>>>>>> each > > > > >>>>>>>>>>>>>>> type of > > > > >>>>>>>>>>>>>>>>>>>> concern and keep this one to discuss > > > > >>>> whether > > > > >>>>>>> Apache > > > > >>>>>>>>> Kafka > > > > >>>>>>>>>>>> wants > > > > >>>>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>>>> have > > > > >>>>>>>>>>>>>>>>>>>> this feature or not. > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> Thanks in advance on behalf of all the > > > > >>>> authors > > > > >>>>>> of > > > > >>>>>>>>> this KIP. > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> ------------------ > > > > >>>>>>>>>>>>>>>>>>>> Josep Prat > > > > >>>>>>>>>>>>>>>>>>>> Open Source Engineering Director, Aiven > > > > >>>>>>>>>>>>>>>>>>>> [email protected] | +491715557497 | > > > > >>>>>>> aiven.io > > > > >>>>>>>>>>>>>>>>>>>> Aiven Deutschland GmbH > > > > >>>>>>>>>>>>>>>>>>>> Alexanderufer 3-7, 10117 Berlin > > > > >>>>>>>>>>>>>>>>>>>> Geschäftsführer: Oskari Saarenmaa, Hannu > > > > >>>>>>> Valtonen, > > > > >>>>>>>>>>>>>>>>>>>> Anna Richardson, Kenneth Chen > > > > >>>>>>>>>>>>>>>>>>>> Amtsgericht Charlottenburg, HRB 209739 B > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >> > > > > >> > > > > > > > > > > > > > > -- Anatolii Popov Senior Software Developer, *Aiven OY* m: +358505126242 w: aiven.io e: [email protected] <https://www.facebook.com/aivencloud> <https://www.linkedin.com/company/aiven/> <https://twitter.com/aiven_io>
