Hi Greg, Thanks for sharing the meeting notes. I agree we should keep polishing the contents of 1150 & high level design in 1163 to prepare for a vote.
Thanks. Luke On Fri, Nov 14, 2025 at 3:54 AM Greg Harris <[email protected]> wrote: > Hi all, > > There was a video call between myself, Ivan Yurchenko, Jun Rao, and Andrew > Schofield pertaining to KIP-1150. Here are the notes from that meeting: > > Ivan: What is the future state of Kafka in this area, in 5 years? > Jun: Do we want something more cloud native? Yes, started with Tiered > Storage. If there’s a better way, we should explore it. In the long term > this will be useful > Because Kafka is used so widely, we need to make sure everything we add is > for the long term and for everyone, not just for a single company. > When we add TS, it doesn’t just solve Uber’s use-case. We want something > that’s high quality/lasts/maintainable, and can work with all existing > capabilities. > If both 1150 and 1176 proceed at the same time, it’s confusing. They > overlap, but Diskless is more ambitious. > If both KIPs are being seriously worked on, then we don’t really need both, > because Diskless clearly is better. Having multiple will confuse people. It > will duplicate some of the effort. > If we want diskless ultimately, what is the short term strategy, to get > some early wins first? > Ivan: Andrew, do you want a more revolutionary approach? > Andrew: Eventually the architecture will change substantially, it may not > be necessary to put all of that bill onto Diskless at once. > Greg: We all agree on having a high quality feature merged upstream, and > supporting all APIs > Jun: We should try and keep things simple, but there is some minimum > complexity needed. > When doing the short term changes (1176), it doesn’t really progress in > changing to a more modern architecture. > Greg: Was TS+Compaction the only feature miss we’ve had so far? > Jun: The danger of only applying changes to some part of the API, you set > the precedence that you only have to implement part of the API. Supporting > the full API set should be a minimum requirement. > Andrew: When we started Kraft, how much did we know the design? > Jun: For Kraft we didn’t really know much about the migration, but the > high-level was clear. > Greg: Is 1150 votable in its current state? > Jun: 1150 should promise to support all APIs. It doesn’t have to have all > the details/apis/etc. KIP-500 didn’t have it. > We do need some high-level design enough to give confidence that the > promise is able to be fulfilled. > Greg: Is the draft version in 1163 enough detail or is more needed? > Jun: We need to agree on the core design, such as leaderless etc. And how > the APIs will be supported. > Greg: Okay we can include these things, and provide a sketch of how the > other leader-based features operate. > Jun: Yeah if at a high level the sketch appears to work, we can approve > that functionality. > Are you committed to doing the more involved and big project? > Greg: Yes, we’re committed to the 1163 design and can’t really accept 1176. > Jun: TS was slow because of Uber resourcing problems > Greg: We’ll push internally for resources, and use the community sentiment > to motivate Aiven. > How far into the future should we look? What sort of scale? > Jun: As long as there’s a path forward, and we’re not closing off future > improvements, we can figure out how to handle a larger scale when it > arises. > Greg: Random replica placement is very harmful, can we recommend users to > use an external tool like CruiseControl? > Jun: Not everyone uses CruiseControl, we would probably need some solution > for this out of the box > Ivan: Should the Batch Coordinator be pluggable? > Jun: Out-of-box experience should be good, good to allow other > implementations > Greg: But it could hurt Kafka feature/upgrade velocity when we wait for > plugin providers to implement it > Ivan: We imagined that maybe cloud hyperscalers could implement it with > e.g. dynamodb > Greg: Could we bake more details of the different providers into Kafka, or > does it still make sense for it to be pluggable? > Jun: Make it whatever is easiest to roll out and add new clients > Andrew: What happens next? Do you want to get KIP-1150 voted? > Ivan: The vote is already open, we’re not too pressed for time. We’ll go > improve the 1163 design and communication. > Is 1176 a competing design? Someone will ask. > Jun: If we are seriously working on something more ambitious, yeah we > shouldn’t do the stop-gap solution. > It’s diverting review resources. If we can get the short term thing in 1yr > but Diskless solution is 2y it makes sense to go for Diskless. If it’s 5yr, > that’s different and maybe the stop-gap solution is needed. > Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we > explicitly exclude 1176? > Andrew: Put your arms around the feature set you actually want, and use > that to rule out 1176. > Probably don’t need -1 votes, most likely KIPs just don’t receive votes. > Ivan: Should we have sync meetings like tiered storage did? > Jun: Satish posted meeting notes regularly, we should do the same. > > To summarize, we will be polishing the contents of 1150 & high level design > in 1163 to prepare for a vote. > We believe that the community should select the feature set of 1150 to > fully eliminate producer cross-zone costs, and make the investment in a > high quality Diskless Topics implementation rather than in stop-gap > solutions. > > Thanks, > Greg > > On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote: > > > This may be a tangent, but we needed to offload storage off of Kafka into > > S3. We are keeping Kafka not as a source of truth, but as a mostly > > ephemeral broker that can come and go as it pleases. Be that scaling or > > outage. Disks can be destroyed and recreated at will, we still retain > data > > and use broker for just that, brokering messages. Not only that, we > reduced > > the requirement on the actual Kafka resources by reducing the size of a > > payload via a claim check pattern. Maybe this is an anti–pattern, but it > is > > super fast and highly cost efficient. We reworked ProducerRequest to > allow > > plugins. We added a custom http plugin that submits every request via a > > persisted connection to a microservice. Microservice stores the payload > and > > returns a tiny json metadata object,a claim check, that can be used to > find > > the actual data. Think of it as zipping the payload. This claim check > > metadata traverses the pipelines with consumers using the urls in > metadata > > to pull what they need. Think unzipping. This allowed us to also pull > ONLY > > the data that we need in graphql like manner. So if you have a 100K json > > payload and you need only a subsection, you can pull that by jmespath. > When > > you have multiple consumer groups yanking down huge payloads it is > > cumbersome on the broker. When you have the same consumer groups yanking > > down a claim check, and then going out of band directly to the source of > > truth, the broker has some breathing room. Obviously our microservice > does > > not go directly to the cloud storage, as that would be too slow. It > stores > > the payload in high speed memory cache and returns asap. That memory is > > eventually persisted into S3. The retrieval goest against the cache > first, > > then against the S3. Overall a rather cheappy and zippy solution. I tried > > proposing the KIP for this, but there was no excitement. Check this out: > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528 > > > > > > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]> wrote: > > > > > > Hi, Andrew, > > > > > > If we want to focus only on reducing cross-zone replication costs, > there > > is > > > an alternative design in the KIP-1176 discussion thread that seems > > simpler > > > than the proposal here. I am copying the outline of that approach > below. > > > > > > 1. A new leader is elected. > > > 2. Leader maintains a first tiered offset, which is initialized to log > > end > > > offset. > > > 3. Leader writes produced data from the client to local log. > > > 4. Leader uploads produced data from all local logs as a combined > object > > > 5. Leader stores the metadata for the combined object in memory. > > > 6. If a follower fetch request has an offset >= first tiered offset, > the > > > metadata for the corresponding combined object is returned. Otherwise, > > the > > > local data is returned. > > > 7. Leader periodically advances first tiered offset. > > > > > > It's still a bit unnatural, but it could work. > > > > > > Hi, Ivan, > > > > > > Are you still committed to proceeding with the original design of > > KIP-1150? > > > > > > Thanks, > > > > > > Jun > > > > > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield < > > [email protected]> > > > wrote: > > > > > >> Hi, > > >> I’ve been following KIP-1150 and friends for a while. I’m going to > jump > > >> into the discussions too. > > >> > > >> Looking back at Jack Vanlightly’s message, I am not quite so convinced > > >> that it’s a kind of fork in the road. The primary aim of the effort is > > to > > >> reduce cross-zone replication costs so Apache Kafka is not > prohibitively > > >> expensive to use on cloud storage. I think it would be entirely viable > > to > > >> prioritise code reuse for an initial implementation of diskless > topics, > > and > > >> we could still have a more cloud-native design in the future. It’s > hard > > to > > >> predict what the community will prioritise in the future. > > >> > > >> Of the three major revisions, I’m in the rev3 camp. We can support > > >> leaderless produce requests, first writing WAL segments into object > > >> storage, and then using the regular partition leaders to sequence the > > >> records. The active log segment for a diskless topic will initially > > contain > > >> batch coordinates rather than record batches. The batch coordinates > can > > be > > >> resolved from WAL segments for consumers, and also in order to prepare > > log > > >> segments for uploading to tiered storage. Jun is probably correct that > > we > > >> need a more frequent object merging process than tiered storage > > provides. > > >> This is just the transition from write-optimised WAL segments to > > >> read-optimised tiered segments, and all of the object storage-based > > >> implementations of Kafka that I’m aware of do this rearrangement. But > > >> perhaps this more frequent object merging is a pre-GA improvement, > > rather > > >> than a strict requirement for an initial implementation for early > access > > >> use. > > >> > > >> For zone-aligned share consumers, the share group assignor is intended > > to > > >> be rack-aware. Consumers should be assigned to partitions with leaders > > in > > >> their zone. The simple assignor is not rack-aware, but it easily could > > be > > >> or we could have a rack-aware assignor. > > >> > > >> Thanks, > > >> Andrew > > >> > > >> > > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]> wrote: > > >>> > > >>> Hi, Ivan, > > >>> > > >>> Thanks for the reply. > > >>> > > >>> "As I understand, you’re speaking about locally materialized > segments. > > >> They > > >>> will indeed consume some IOPS. See them as a cache that could always > be > > >>> restored from the remote storage. While it’s not ideal, it's still OK > > to > > >>> lose data in them due to a machine crash, for example. Because of > this, > > >> we > > >>> can avoid explicit flushing on local materialized segments at all and > > let > > >>> the file system and page cache figure out when to flush optimally. > This > > >>> would not eliminate the extra IOPS, but should reduce it > dramatically, > > >>> depending on throughput for each partition. We, of course, continue > > >>> flushing the metadata segments as before." > > >>> > > >>> If we have a mix of classic and diskless topics on the same broker, > > it's > > >>> important that the classic topics' data is flushed to disk as quickly > > as > > >>> possible. To achieve this, users typically set dirty_expire_centisecs > > in > > >>> the kernel based on the number of available disk IOPS. Once you set > > this > > >>> number, it applies to all dirty files, including the cached data in > > >>> diskless topics. So, if there are more files actively accumulating > > data, > > >>> the flush frequency and therefore RPO is reduced for classic topics. > > >>> > > >>> "We should have mentioned this explicitly, but this step, in fact, > > >> remains > > >>> in the form of segments offloading to tiered storage. When we > assemble > > a > > >>> segment and hand it over to RemoteLogManager, we’re effectively doing > > >>> metadata compaction: replacing a big number of pieces of metadata > about > > >>> individual batches with a single record in __remote_log_metadata." > > >>> > > >>> The object merging in tier storage typically only kicks in after a > few > > >>> hours. The impact is (1) the amount of accumulated metadata is still > > >> quite > > >>> large; (2) there are many small objects, leading to poor read > > >> performance. > > >>> I think we need a more frequent object merging process than tier > > storage > > >>> provides. > > >>> > > >>> Jun > > >>> > > >>> > > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <[email protected]> > > wrote: > > >>> > > >>>> Hello Jack, Jun, Luke, and all! > > >>>> > > >>>> Thank you for your messages. > > >>>> > > >>>> Let me first address some of Jun’s comments. > > >>>> > > >>>>> First, it degrades the durability. > > >>>>> For each partition, now there are two files being actively written > > at a > > >>>>> given point of time, one for the data and another for the metadata. > > >>>>> Flushing each file requires a separate IO. If the disk has 1K IOPS > > and > > >> we > > >>>>> have 5K partitions in a broker, currently we can afford to flush > each > > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we > > double > > >>>> the > > >>>>> number of files per partition, we can only flush each partition > every > > >> 10 > > >>>>> seconds, which makes RPO twice as bad. > > >>>> > > >>>> As I understand, you’re speaking about locally materialized > segments. > > >> They > > >>>> will indeed consume some IOPS. See them as a cache that could always > > be > > >>>> restored from the remote storage. While it’s not ideal, it's still > OK > > to > > >>>> lose data in them due to a machine crash, for example. Because of > > this, > > >> we > > >>>> can avoid explicit flushing on local materialized segments at all > and > > >> let > > >>>> the file system and page cache figure out when to flush optimally. > > This > > >>>> would not eliminate the extra IOPS, but should reduce it > dramatically, > > >>>> depending on throughput for each partition. We, of course, continue > > >>>> flushing the metadata segments as before. > > >>>> > > >>>> It’s worth making a note on caching. I think nobody will disagree > that > > >>>> doing direct reads from remote storage every time a batch is > requested > > >> by a > > >>>> consumer will not be practical neither from the performance nor from > > the > > >>>> economy point of view. We need a way to keep the number of GET > > requests > > >>>> down. There are multiple options, for example: > > >>>> 1. Rack-aware distributed in-memory caching. > > >>>> 2. Local in-memory caching. Comes with less network chattiness and > > >> works > > >>>> well if we have more or less stable brokers to consume from. > > >>>> 3. Materialization of diskless logs on local disk. Way lower impact > on > > >>>> RAM and also requires stable brokers for consumption (using just > > >> assigned > > >>>> replicas will probably work well). > > >>>> > > >>>> Materialization is one of possible options, but we can choose > another > > >> one. > > >>>> However, we will have this dilemma regardless of whether we have an > > >>>> explicit coordinator or we go “coordinator-less”. > > >>>> > > >>>>> Second, if we ever need this > > >>>>> metadata somewhere else, say in the WAL file manager, the consumer > > >> needs > > >>>> to > > >>>>> subscribe to every partition in the cluster, which is inefficient. > > The > > >>>>> actual benefit of this approach is also questionable. On the > surface, > > >> it > > >>>>> might seem that we could reduce the number of lines that need to be > > >>>> changed > > >>>>> for this KIP. However, the changes are quite intrusive to the > classic > > >>>>> partition's code path and will probably make the code base harder > to > > >>>>> maintain in the long run. I like the original approach based on the > > >> batch > > >>>>> coordinator much better than this one. We could probably refactor > the > > >>>>> producer state code so that it could be reused in the batch > > >> coordinator. > > >>>> > > >>>> It’s hard to disagree with this. The explicit coordinator is more a > > side > > >>>> thing, while coordinator-less approach is more about extending > > >>>> ReplicaManager, UnifiedLog and others substantially. > > >>>> > > >>>>> Thanks for addressing the concerns on the number of RPCs in the > > produce > > >>>>> path. I agree that with the metadata crafting mechanism, we could > > >>>> mitigate > > >>>>> the PRC problem. However, since we now require the metadata to be > > >>>>> collocated with the data on the same set of brokers, it's weird > that > > >> they > > >>>>> are now managed by different mechanisms. The data assignment now > uses > > >> the > > >>>>> metadata crafting mechanism, but the metadata is stored in the > > classic > > >>>>> partition using its own assignment strategy. It will be complicated > > to > > >>>> keep > > >>>>> them collocated. > > >>>> > > >>>> I would like to note that the metadata crafting is needed only to > tell > > >>>> producers which brokers they should send Produce requests to, but > data > > >> (as > > >>>> in “locally materialized log”) is located on partition replicas, > i.e. > > >>>> automatically co-located with metadata. > > >>>> > > >>>> As a side note, it would probably be better that instead of > implicitly > > >>>> crafting partition metadata, we extend the metadata protocol so that > > for > > >>>> diskless partitions we return not only the leader and replicas, but > > also > > >>>> some “recommended produce brokers”, selected for optimal performance > > and > > >>>> costs. Producers will pick ones in their racks. > > >>>> > > >>>>> I am also concerned about the removal of the object > > compaction/merging > > >>>>> step. > > >>>> > > >>>> We should have mentioned this explicitly, but this step, in fact, > > >> remains > > >>>> in the form of segments offloading to tiered storage. When we > > assemble a > > >>>> segment and hand it over to RemoteLogManager, we’re effectively > doing > > >>>> metadata compaction: replacing a big number of pieces of metadata > > about > > >>>> individual batches with a single record in __remote_log_metadata. > > >>>> > > >>>> We could create a Diskless-specific merging mechanism instead if > > needed. > > >>>> It’s rather easy with the explicit coordinator approach. With the > > >>>> coordinator-less approach, this would probably be a bit more tricky > > >>>> (rewriting the tail of the log by the leader + replicating this > change > > >>>> reliably). > > >>>> > > >>>>> I see a tendency toward primarily optimizing for the fewest code > > >> changes > > >>>> in > > >>>>> the KIP. Instead, our primary goal should be a clean design that > can > > >> last > > >>>>> for the long term. > > >>>> > > >>>> Yes, totally agree. > > >>>> > > >>>> > > >>>> > > >>>> Luke, > > >>>>> I'm wondering if the complexity of designing txn and queue is > because > > >> of > > >>>>> leaderless cluster, do you think it will be simpler if we only > focus > > on > > >>>> the > > >>>>> "diskless" design to handle object compaction/merging to/from the > > >> remote > > >>>>> storage to save the cross-AZ cost? > > >>>> > > >>>> After some evolution of the original proposal, leaderless is now > > >> limited. > > >>>> We only need to be able to accept Produce requests on more than one > > >> broker > > >>>> to eliminate the cross-AZ costs for producers. Do I get it right > that > > >> you > > >>>> propose to get rid of this? Or do I misunderstand? > > >>>> > > >>>> > > >>>> > > >>>> Let’s now look at this problem from a higher level, as Jack > proposed. > > As > > >>>> it was said, the big choice we need to make is whether we 1) create > an > > >>>> explicit batch coordinator; or 2) go for the coordinator-less > > approach, > > >>>> where each diskless partition is managed by its leader as in classic > > >> topics. > > >>>> > > >>>> If we try to compare the two approaches: > > >>>> > > >>>> Pluggability: > > >>>> - Explicit coordinator: Possible. For example, some setups may > benefit > > >>>> from batch metadata being stored in a cloud database (such as AWS > > >> DynamoDB > > >>>> or GCP Spanner). > > >>>> - Coordinator-less: Impossible. > > >>>> > > >>>> Scalability and fault tolerance: > > >>>> - Explicit coordinator: Depends on the implementation and it’s also > > >>>> necessary to actively work for it. > > >>>> - Coordinator-less: Closer to classic Kafka topics. Scaling is done > by > > >>>> partition placement, partitions could fail independently. > > >>>> > > >>>> Separation of concerns: > > >>>> - Explicit coordinator: Very good. Diskless remains more independent > > >> from > > >>>> classic topics in terms of code and workflows. For example, the > > >>>> above-mentioned non-tiered storage metadata compaction mechanism > could > > >> be > > >>>> relatively simply implemented with it. As a flip side of this, some > > >>>> workflows (e.g. transactions) will have to be adapted. > > >>>> - Coordinator-less: Less so. It leans to the opposite: bringing > > diskless > > >>>> closer to classic topics. Some code paths and workflows could be > more > > >>>> straightforwardly reused, but they will inevitably have to be > adapted > > to > > >>>> accommodate both topic types as also discussed. > > >>>> > > >>>> Cloud-nativeness. This is a vague concept, also related to the > > previous, > > >>>> but let’s try: > > >>>> - Explicit coordinator: Storing and processing metadata separately > > makes > > >>>> it easier for brokers to take different roles, be purely stateless > if > > >>>> needed, etc. > > >>>> - Coordinator-less: Less so. Something could be achieved with > creative > > >>>> partition placement, but not much. > > >>>> > > >>>> Both seem to have their pros and cons. However, answering Jack’s > > >> question, > > >>>> the explicit coordinator approach may indeed lead to a more flexible > > >> design. > > >>>> > > >>>> > > >>>> The purpose of this deviation in the discussion was to receive a > > >>>> preliminary community evaluation of the coordinator-less approach > > >> without > > >>>> taking on the task of writing a separate KIP and fitting it in the > > >> system > > >>>> of KIP-1150 and its children. We’re open to stopping it and getting > > >> back to > > >>>> working out the coordinator design if the community doesn’t favor > the > > >>>> proposed approach. > > >>>> > > >>>> Best, > > >>>> Ivan and Diskless team > > >>>> > > >>>> On Mon, Oct 20, 2025, at 05:58, Luke Chen wrote: > > >>>>> Hi Ivan, > > >>>>> > > >>>>> As Jun pointed out, the updated design seems to have some > > shortcomings > > >>>>> although it simplifies the implementation. > > >>>>> > > >>>>> I'm wondering if the complexity of designing txn and queue is > because > > >> of > > >>>>> leaderless cluster, do you think it will be simpler if we only > focus > > on > > >>>> the > > >>>>> "diskless" design to handle object compaction/merging to/from the > > >> remote > > >>>>> storage to save the cross-AZ cost? > > >>>>> > > >>>>> > > >>>>> Thank you, > > >>>>> Luke > > >>>>> > > >>>>> On Sat, Oct 18, 2025 at 5:22 AM Jun Rao <[email protected]> > > >>>> wrote: > > >>>>> > > >>>>>> Hi, Ivan, > > >>>>>> > > >>>>>> Thanks for the explanation. > > >>>>>> > > >>>>>> "we write the reference to the WAL file with the batch data" > > >>>>>> > > >>>>>> I understand the approach now, but I think it is a hacky one. > There > > >> are > > >>>>>> multiple short comings with this design. First, it degrades the > > >>>> durability. > > >>>>>> For each partition, now there are two files being actively written > > at > > >> a > > >>>>>> given point of time, one for the data and another for the > metadata. > > >>>>>> Flushing each file requires a separate IO. If the disk has 1K IOPS > > and > > >>>> we > > >>>>>> have 5K partitions in a broker, currently we can afford to flush > > each > > >>>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we > > double > > >>>> the > > >>>>>> number of files per partition, we can only flush each partition > > every > > >>>> 10 > > >>>>>> seconds, which makes RPO twice as bad. Second, if we ever need > this > > >>>>>> metadata somewhere else, say in the WAL file manager, the consumer > > >>>> needs to > > >>>>>> subscribe to every partition in the cluster, which is inefficient. > > The > > >>>>>> actual benefit of this approach is also questionable. On the > > surface, > > >>>> it > > >>>>>> might seem that we could reduce the number of lines that need to > be > > >>>> changed > > >>>>>> for this KIP. However, the changes are quite intrusive to the > > classic > > >>>>>> partition's code path and will probably make the code base harder > to > > >>>>>> maintain in the long run. I like the original approach based on > the > > >>>> batch > > >>>>>> coordinator much better than this one. We could probably refactor > > the > > >>>>>> producer state code so that it could be reused in the batch > > >>>> coordinator. > > >>>>>> > > >>>>>> Thanks for addressing the concerns on the number of RPCs in the > > >> produce > > >>>>>> path. I agree that with the metadata crafting mechanism, we could > > >>>> mitigate > > >>>>>> the PRC problem. However, since we now require the metadata to be > > >>>>>> collocated with the data on the same set of brokers, it's weird > that > > >>>> they > > >>>>>> are now managed by different mechanisms. The data assignment now > > uses > > >>>> the > > >>>>>> metadata crafting mechanism, but the metadata is stored in the > > classic > > >>>>>> partition using its own assignment strategy. It will be > complicated > > to > > >>>> keep > > >>>>>> them collocated. > > >>>>>> > > >>>>>> I am also concerned about the removal of the object > > compaction/merging > > >>>>>> step. My first concern is on the amount of metadata that need to > be > > >>>> kept. > > >>>>>> Without object compcation, the metadata generated in the produce > > path > > >>>> can > > >>>>>> only be deleted after remote tiering kicks in. Let's say for every > > >>>> 250ms we > > >>>>>> produce 100 byte of metadata per partition. Let's say remoting > > tiering > > >>>>>> kicks in after 5 hours. In a cluster with 100K partitions, we need > > to > > >>>> keep > > >>>>>> about 100 * (1 / 0.2) * 5 * 3600 * 100K = 720 GB metadata, quite > > >>>>>> signficant. A second concern is on performance. Every time we need > > to > > >>>>>> rebuild the caching data, we need to read a bunch of small objects > > >>>> from S3, > > >>>>>> slowing down the building process. If a consumer happens to need > > such > > >>>> data, > > >>>>>> it could slow down the application. > > >>>>>> > > >>>>>> I see a tendency toward primarily optimizing for the fewest code > > >>>> changes in > > >>>>>> the KIP. Instead, our primary goal should be a clean design that > can > > >>>> last > > >>>>>> for the long term. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> Jun > > >>>>>> > > >>>>>> On Tue, Oct 14, 2025 at 11:02 AM Ivan Yurchenko <[email protected]> > > >>>> wrote: > > >>>>>> > > >>>>>>> Hi Jun, > > >>>>>>> > > >>>>>>> Thank you for your message. I’m sorry that I failed to clearly > > >>>> explain > > >>>>>> the > > >>>>>>> idea. Let me try to fix this. > > >>>>>>> > > >>>>>>>> Does each partition now have a metadata partition and a separate > > >>>> data > > >>>>>>>> partition? If so, I am concerned that it essentially doubles the > > >>>> number > > >>>>>>> of > > >>>>>>>> partitions, which impacts the number of open file descriptors > and > > >>>> the > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a > > separate > > >>>>>>>> partition just to store the metadata. It's as if we are creating > > an > > >>>>>>>> internal topic with an unbounded number of partitions. > > >>>>>>> > > >>>>>>> No. There will be only one physical partition per diskless > > >>>> partition. Let > > >>>>>>> me explain this with an example. Let’s say we have a diskless > > >>>> partition > > >>>>>>> topic-0. It has three replicas 0, 1, 2; 0 is the leader. We > produce > > >>>> some > > >>>>>>> batches to this partition. The content of the segment file will > be > > >>>>>>> something like this (for each batch): > > >>>>>>> > > >>>>>>> BaseOffset: 00000000000000000000 (like in classic) > > >>>>>>> Length: 123456 (like in classic) > > >>>>>>> PartitionLeaderEpoch: like in classic > > >>>>>>> Magic: like in classic > > >>>>>>> CRC: like in classic > > >>>>>>> Attributes: like in classic > > >>>>>>> LastOffsetDelta: like in classic > > >>>>>>> BaseTimestamp: like in classic > > >>>>>>> MaxTimestamp: like in classic > > >>>>>>> ProducerId: like in classic > > >>>>>>> ProducerEpoch: like in classic > > >>>>>>> BaseSequence: like in classic > > >>>>>>> RecordsCount: like in classic > > >>>>>>> Records: > > >>>>>>> path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a; byte > offset > > >>>>>>> 123456 > > >>>>>>> > > >>>>>>> It looks very much like classic log entries. The only difference > is > > >>>> that > > >>>>>>> instead of writing real Records, we write the reference to the > WAL > > >>>> file > > >>>>>>> with the batch data (I guess we need only the name and the byte > > >>>> offset, > > >>>>>>> because the byte length is the standard field above). Otherwise, > > >>>> it’s a > > >>>>>>> normal Kafka log with the leader and replicas. > > >>>>>>> > > >>>>>>> So we have as many partitions for diskless as for classic. As of > > open > > >>>>>> file > > >>>>>>> descriptors, let’s proceed to the following: > > >>>>>>> > > >>>>>>>> Are the metadata and > > >>>>>>>> the data for the same partition always collocated on the same > > >>>> broker? > > >>>>>> If > > >>>>>>>> so, how do we enforce that when replicas are reassigned? > > >>>>>>> > > >>>>>>> The source of truth for the data is still in WAL files on object > > >>>> storage. > > >>>>>>> The source of truth for the metadata is in segment files on the > > >>>> brokers > > >>>>>> in > > >>>>>>> the replica set. Two new mechanisms are planned, both independent > > of > > >>>> this > > >>>>>>> new proposal, but I want to present them to give the idea that > > only a > > >>>>>>> limited amount of data files will be operated locally: > > >>>>>>> - We want to assemble batches into segment files and offload them > > to > > >>>>>>> tiered storage in order to prevent the unbounded growth of batch > > >>>>>> metadata. > > >>>>>>> For this, we need to open only a few file descriptors (for the > > >>>> segment > > >>>>>>> file itself + the necessary indexes) before the segment is fully > > >>>> written > > >>>>>>> and handed over to RemoteLogManager. > > >>>>>>> - We want to assemble local segment files for caching purposes as > > >>>> well, > > >>>>>>> i.e. to speed up fetching. This will not materialize the full > > >>>> content of > > >>>>>>> the log, but only the hot set according to some policy (or > > >>>> configurable > > >>>>>>> policies), i.e. the number of segments and file descriptors will > > >>>> also be > > >>>>>>> limited. > > >>>>>>> > > >>>>>>>> The number of RPCs in the produce path is significantly higher. > > For > > >>>>>>>> example, if a produce request has 100 partitions, in a cluster > > >>>> with 100 > > >>>>>>>> brokers, each produce request could generate 100 more RPC > > requests. > > >>>>>> This > > >>>>>>>> will significantly increase the request rate. > > >>>>>>> > > >>>>>>> This is a valid concern that we considered, but this issue can be > > >>>>>>> mitigated. I’ll try to explain the approach. > > >>>>>>> The situation with a single broker is trivial: all the commit > > >>>> requests go > > >>>>>>> from the broker to itself. > > >>>>>>> Let’s scale this to a multi-broker cluster, but located in the > > single > > >>>>>> rack > > >>>>>>> (AZ). Any broker can accept Produce requests for diskless > > >>>> partitions, but > > >>>>>>> we can tell producers (through metadata) to always send Produce > > >>>> requests > > >>>>>> to > > >>>>>>> leaders. For example, broker 0 hosts the leader replicas for > > diskless > > >>>>>>> partitions t1-0, t2-1, t3-0. It will receive diskless Produce > > >>>> requests > > >>>>>> for > > >>>>>>> these partitions in various combinations, but only for them. > > >>>>>>> > > >>>>>>> Broker 0 > > >>>>>>> +-----------------+ > > >>>>>>> | t1-0 | > > >>>>>>> | t2-1 <--------------------+ > > >>>>>>> | t3-0 | | > > >>>>>>> produce | +-------------+ | | > > >>>>>>> requests | | diskless | | | > > >>>>>>> --------------->| produce +--------------+ > > >>>>>>> for these | | WAL buffer | | commit requests > > >>>>>>> partitions | +-------------+ | for these partitions > > >>>>>>> | | > > >>>>>>> +-----------------+ > > >>>>>>> > > >>>>>>> The same applies for other brokers in this cluster. Effectively, > > each > > >>>>>>> broker will commit only to itself, which effectively means 1 > commit > > >>>>>> request > > >>>>>>> per WAL buffer (this may be 0 physical network calls, if we wish, > > >>>> just a > > >>>>>>> local function call). > > >>>>>>> > > >>>>>>> Now let’s scale this to multiple racks (AZs). Obviously, we > cannot > > >>>> always > > >>>>>>> send Produce requests to the designated leaders of diskless > > >>>> partitions: > > >>>>>>> this would mean inter-AZ network traffic, which we would like to > > >>>> avoid. > > >>>>>> To > > >>>>>>> avoid it, we say that every broker has a “diskless produce > > >>>>>> representative” > > >>>>>>> in every AZ. If we continue our example: when a Produce request > for > > >>>> t1-0, > > >>>>>>> t2-1, or t3-0 comes from a producer in AZ 0, it lands on broker 0 > > >>>> (in the > > >>>>>>> broker’s AZ the representative is the broker itself). However, if > > it > > >>>>>> comes > > >>>>>>> from AZ 1, it lands on broker 1; in AZ 2, it’s broker 2. > > >>>>>>> > > >>>>>>> |produce requests |produce requests |produce > > >>>> requests > > >>>>>>> |for t1-0, t2-1, t3-0 |for t1-0, t2-1, t3-0 |for t1-0, > t2-1, > > >>>>>> t3-0 > > >>>>>>> |from AZ 0 |from AZ 1 |from AZ 2 > > >>>>>>> v v v > > >>>>>>> Broker 0 (AZ 0) Broker 1 (AZ 1) Broker 2 (AZ 2) > > >>>>>>> +---------------+ +---------------+ +---------------+ > > >>>>>>> | t1-0 | | | | | > > >>>>>>> | t2-1 | | | | | > > >>>>>>> | t3-0 | | | | | > > >>>>>>> +---------------+ +--------+------+ +--------+------+ > > >>>>>>> ^ ^ | | > > >>>>>>> | +--------------------+ | > > >>>>>>> | commit requests for these partitions | > > >>>>>>> | | > > >>>>>>> +-------------------------------------------------+ > > >>>>>>> commit requests for these partitions > > >>>>>>> > > >>>>>>> All the partitions that broker 0 is the leader of will be > > >>>> “represented” > > >>>>>> by > > >>>>>>> brokers 1 and 2 in their AZs. > > >>>>>>> > > >>>>>>> Of course, this relationship goes both ways between AZs (not > > >>>> necessarily > > >>>>>>> between the same brokers). It means that provided the cluster is > > >>>> balanced > > >>>>>>> by the number of brokers per AZ, each broker will represent > > >>>>>> (number_of_azs > > >>>>>>> - 1) other brokers. This will result in the situation that for > the > > >>>>>> majority > > >>>>>>> of commits, each broker will do up to (number_of_azs - 1) network > > >>>> commit > > >>>>>>> requests (plus one local). Cloud regions tend to have 3 AZs, very > > >>>> rarely > > >>>>>>> more. That means, brokers will be doing up to 2 network commit > > >>>> requests > > >>>>>> per > > >>>>>>> WAL file. > > >>>>>>> > > >>>>>>> There are the following exceptions: > > >>>>>>> 1. Broker count imbalance between AZs. For example, when we have > 2 > > >>>> AZs > > >>>>>> and > > >>>>>>> one has three brokers and another AZ has one. This one broker > will > > do > > >>>>>>> between 1 and 3 commit requests per WAL file. This is not an > > extreme > > >>>>>>> amplification. Such an imbalance is not healthy in most practical > > >>>> setups > > >>>>>>> and should be avoided anyway. > > >>>>>>> 2. Leadership changes and metadata propagation period. When the > > >>>> partition > > >>>>>>> t3-0 is relocated from broker 0 to some broker 3, the producers > > will > > >>>> not > > >>>>>>> know this immediately (unless we want to be strict and respond > with > > >>>>>>> NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will come > > >>>> together > > >>>>>> in a > > >>>>>>> WAL buffer on broker 2, it will have to send two commit requests: > > to > > >>>>>> broker > > >>>>>>> 0 to commit t1-0 and t2-1, and to broker 3 to commit t3-0. This > > >>>> situation > > >>>>>>> is not permanent and as producers update the cluster metadata, it > > >>>> will be > > >>>>>>> resolved. > > >>>>>>> > > >>>>>>> This all could be built with the metadata crafting mechanism only > > >>>> (which > > >>>>>>> is anyway needed for Diskless in one way or another to direct > > >>>> producers > > >>>>>> and > > >>>>>>> consumers where we need to avoid inter-AZ traffic), just with the > > >>>> right > > >>>>>>> policy for it (for example, some deterministic hash-based > formula). > > >>>> I.e. > > >>>>>> no > > >>>>>>> explicit support for “produce representative” or anything like > this > > >>>> is > > >>>>>>> needed on the cluster level, in KRaft, etc. > > >>>>>>> > > >>>>>>>> The same WAL file metadata is now duplicated into two places, > > >>>> partition > > >>>>>>>> leader and WAL File Manager. Which one is the source of truth, > and > > >>>> how > > >>>>>> do > > >>>>>>>> we maintain consistency between the two places? > > >>>>>>> > > >>>>>>> We do only two operations on WAL files that span multiple > diskless > > >>>>>>> partitions: committing and deleting. Commits can be done > > >>>> independently as > > >>>>>>> described above. But deletes are different, because when a file > is > > >>>>>> deleted, > > >>>>>>> this affects all the partitions that still have alive batches in > > this > > >>>>>> file > > >>>>>>> (if any). > > >>>>>>> > > >>>>>>> The WAL file manager is a necessary point of coordination to > delete > > >>>> WAL > > >>>>>>> files safely. We can say it is the source of truth about files > > >>>>>> themselves, > > >>>>>>> while the partition leaders and their logs hold the truth about > > >>>> whether a > > >>>>>>> particular file contains live batches of this particular > partition. > > >>>>>>> > > >>>>>>> The file manager will do this important task: be able to say for > > sure > > >>>>>> that > > >>>>>>> a file does not contain any live batch of any existing partition. > > For > > >>>>>> this, > > >>>>>>> it will have to periodically check against the partition leaders. > > >>>>>>> Considering that batch deletion is irreversible, when we declare > a > > >>>> file > > >>>>>>> “empty”, this is guaranteed to be and stay so. > > >>>>>>> > > >>>>>>> The file manager has to know about files being committed to start > > >>>> track > > >>>>>>> them and periodically check if they are empty. We can consider > > >>>> various > > >>>>>> ways > > >>>>>>> to achieve this: > > >>>>>>> 1. As was proposed in my previous message: best effort commit by > > >>>> brokers > > >>>>>> + > > >>>>>>> periodic prefix scans of object storage to detect files that went > > >>>> below > > >>>>>> the > > >>>>>>> radar due to network issue or the file manager temporary > > >>>> unavailability. > > >>>>>>> We’re speaking about listing the file names only and opening only > > >>>>>>> previously unknown files in order to find the partitions involved > > >>>> with > > >>>>>> them. > > >>>>>>> 2. Only do scans without explicit commit, i.e. fill the list of > > files > > >>>>>>> fully asynchronously and in the background. This may be not ideal > > >>>> due to > > >>>>>>> costs and performance of scanning tons of files. However, the > > number > > >>>> of > > >>>>>>> live WAL files should be limited due to tiered storage > offloading + > > >>>> we > > >>>>>> can > > >>>>>>> optimize this if we give files some global soft order in their > > names. > > >>>>>>> > > >>>>>>>> I am not sure how this design simplifies the implementation. The > > >>>>>> existing > > >>>>>>>> producer/replication code can't be simply reused. Adjusting both > > >>>> the > > >>>>>>> write > > >>>>>>>> path in the leader and the replication path in the follower to > > >>>>>> understand > > >>>>>>>> batch-header only data is quite intrusive to the existing logic. > > >>>>>>> > > >>>>>>> It is true that we’ll have to change LocalLog and UnifiedLog in > > >>>> order to > > >>>>>>> support these changes. However, it seems that idempotence, > > >>>> transactions, > > >>>>>>> queues, tiered storage will have to be changed less than with the > > >>>>>> original > > >>>>>>> design. This is because the partition leader state would remain > in > > >>>> the > > >>>>>> same > > >>>>>>> place (on brokers) and existing workflows that involve it would > > have > > >>>> to > > >>>>>> be > > >>>>>>> changed less compared to the situation where we globalize the > > >>>> partition > > >>>>>>> leader state in the batch coordinator. I admit this is hard to > make > > >>>>>>> convincing without both real implementations to hand :) > > >>>>>>> > > >>>>>>>> I am also > > >>>>>>>> not sure how this enables seamless switching the topic modes > > >>>> between > > >>>>>>>> diskless and classic. Could you provide more details on those? > > >>>>>>> > > >>>>>>> Let’s consider the scenario of turning a classic topic into > > >>>> diskless. The > > >>>>>>> user sets diskless.enabled=true, the leader receives this > metadata > > >>>> update > > >>>>>>> and does the following: > > >>>>>>> 1. Stop accepting normal append writes. > > >>>>>>> 2. Close the current active segment. > > >>>>>>> 3. Start a new segment that will be written in the diskless > format > > >>>> (i.e. > > >>>>>>> without data). > > >>>>>>> 4. Start accepting diskless commits. > > >>>>>>> > > >>>>>>> Since it’s the same log, the followers will know about that > switch > > >>>>>>> consistently. They will finish replicating the classic segments > and > > >>>> start > > >>>>>>> replicating the diskless ones. They will always know where each > > >>>> batch is > > >>>>>>> located (either inside a classic segment or referenced by a > > diskless > > >>>>>> one). > > >>>>>>> Switching back should be similar. > > >>>>>>> > > >>>>>>> Doing this with the coordinator is possible, but has some > caveats. > > >>>> The > > >>>>>>> leader must do the following: > > >>>>>>> 1. Stop accepting normal append writes. > > >>>>>>> 2. Close the current active segment. > > >>>>>>> 3. Write a special control segment to persist and replicate the > > fact > > >>>> that > > >>>>>>> from offset N the partition is now in the diskless mode. > > >>>>>>> 4. Inform the coordinator about the first offset N of the > “diskless > > >>>> era”. > > >>>>>>> 5. Inform the controller quorum that the transition has finished > > and > > >>>> that > > >>>>>>> brokers now can process diskless writes for this partition. > > >>>>>>> This could fail at some points, so this will probably require > some > > >>>>>>> explicit state machine with replication either in the partition > log > > >>>> or in > > >>>>>>> KRaft. > > >>>>>>> > > >>>>>>> It seems that the coordinator-less approach makes this simpler > > >>>> because > > >>>>>> the > > >>>>>>> “coordinator” for the partition and the partition leader are the > > >>>> same and > > >>>>>>> they store the partition metadata in the same log, too. While in > > the > > >>>>>>> coordinator approach we have to perform some kind of a > distributed > > >>>> commit > > >>>>>>> to handover metadata management from the classic partition leader > > to > > >>>> the > > >>>>>>> batch coordinator. > > >>>>>>> > > >>>>>>> I hope these explanations help to clarify the idea. Please let me > > >>>> know if > > >>>>>>> I should go deeper anywhere. > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Ivan and the Diskless team > > >>>>>>> > > >>>>>>> On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote: > > >>>>>>>> Hi, Ivan, > > >>>>>>>> > > >>>>>>>> Thanks for the update. > > >>>>>>>> > > >>>>>>>> I am not sure that I fully understand the new design, but it > seems > > >>>> less > > >>>>>>>> clean than before. > > >>>>>>>> > > >>>>>>>> Does each partition now have a metadata partition and a separate > > >>>> data > > >>>>>>>> partition? If so, I am concerned that it essentially doubles the > > >>>> number > > >>>>>>> of > > >>>>>>>> partitions, which impacts the number of open file descriptors > and > > >>>> the > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a > > separate > > >>>>>>>> partition just to store the metadata. It's as if we are creating > > an > > >>>>>>>> internal topic with an unbounded number of partitions. Are the > > >>>> metadata > > >>>>>>> and > > >>>>>>>> the data for the same partition always collocated on the same > > >>>> broker? > > >>>>>> If > > >>>>>>>> so, how do we enforce that when replicas are reassigned? > > >>>>>>>> > > >>>>>>>> The number of RPCs in the produce path is significantly higher. > > For > > >>>>>>>> example, if a produce request has 100 partitions, in a cluster > > >>>> with 100 > > >>>>>>>> brokers, each produce request could generate 100 more RPC > > requests. > > >>>>>> This > > >>>>>>>> will significantly increase the request rate. > > >>>>>>>> > > >>>>>>>> The same WAL file metadata is now duplicated into two places, > > >>>> partition > > >>>>>>>> leader and WAL File Manager. Which one is the source of truth, > and > > >>>> how > > >>>>>> do > > >>>>>>>> we maintain consistency between the two places? > > >>>>>>>> > > >>>>>>>> I am not sure how this design simplifies the implementation. The > > >>>>>> existing > > >>>>>>>> producer/replication code can't be simply reused. Adjusting both > > >>>> the > > >>>>>>> write > > >>>>>>>> path in the leader and the replication path in the follower to > > >>>>>> understand > > >>>>>>>> batch-header only data is quite intrusive to the existing > logic. I > > >>>> am > > >>>>>>> also > > >>>>>>>> not sure how this enables seamless switching the topic modes > > >>>> between > > >>>>>>>> diskless and classic. Could you provide more details on those? > > >>>>>>>> > > >>>>>>>> Jun > > >>>>>>>> > > >>>>>>>> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko <[email protected]> > > >>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hi dear Kafka community, > > >>>>>>>>> > > >>>>>>>>> In the initial Diskless proposal, we proposed to have a > separate > > >>>>>>>>> component, batch/diskless coordinator, whose role would be to > > >>>>>> centrally > > >>>>>>>>> manage the batch and WAL file metadata for diskless topics. > This > > >>>>>>> component > > >>>>>>>>> drew many reasonable comments from the community about how it > > >>>> would > > >>>>>>> support > > >>>>>>>>> various Kafka features (transactions, queues) and its > > >>>> scalability. > > >>>>>>> While we > > >>>>>>>>> believe we have good answers to all the expressed concerns, we > > >>>> took a > > >>>>>>> step > > >>>>>>>>> back and looked at the problem from a different perspective. > > >>>>>>>>> > > >>>>>>>>> We would like to propose an alternative Diskless design > *without > > >>>> a > > >>>>>>>>> centralized coordinator*. We believe this approach has > potential > > >>>> and > > >>>>>>>>> propose to discuss it as it may be more appealing to the > > >>>> community. > > >>>>>>>>> > > >>>>>>>>> Let us explain the idea. Most of the complications with the > > >>>> original > > >>>>>>>>> Diskless approach come from one necessary architecture change: > > >>>>>>> globalizing > > >>>>>>>>> the local state of partition leader in the batch coordinator. > > >>>> This > > >>>>>>> causes > > >>>>>>>>> deviations to the established workflows in various features > like > > >>>>>>> produce > > >>>>>>>>> idempotence and transactions, queues, retention, etc. These > > >>>>>> deviations > > >>>>>>> need > > >>>>>>>>> to be carefully considered, designed, and later implemented and > > >>>>>>> tested. In > > >>>>>>>>> the new approach we want to avoid this by making partition > > >>>> leaders > > >>>>>>> again > > >>>>>>>>> responsible for managing their partitions, even in diskless > > >>>> topics. > > >>>>>>>>> > > >>>>>>>>> In classic Kafka topics, batch data and metadata are blended > > >>>> together > > >>>>>>> in > > >>>>>>>>> the one partition log. The crux of the Diskless idea is to > > >>>> decouple > > >>>>>>> them > > >>>>>>>>> and move data to the remote storage, while keeping metadata > > >>>> somewhere > > >>>>>>> else. > > >>>>>>>>> Using the central batch coordinator for managing batch metadata > > >>>> is > > >>>>>> one > > >>>>>>> way, > > >>>>>>>>> but not the only. > > >>>>>>>>> > > >>>>>>>>> Let’s now think about managing metadata for each user partition > > >>>>>>>>> independently. Generally partitions are independent and don’t > > >>>> share > > >>>>>>>>> anything apart from that their data are mixed in WAL files. If > we > > >>>>>>> figure > > >>>>>>>>> out how to commit and later delete WAL files safely, we will > > >>>> achieve > > >>>>>>> the > > >>>>>>>>> necessary autonomy that allows us to get rid of the central > batch > > >>>>>>>>> coordinator. Instead, *each diskless user partition will be > > >>>> managed > > >>>>>> by > > >>>>>>> its > > >>>>>>>>> leader*, as in classic Kafka topics. Also like in classic > > >>>> topics, the > > >>>>>>>>> leader uses the partition log as the way to persist batch > > >>>> metadata, > > >>>>>>> i.e. > > >>>>>>>>> the regular batch header + the information about how to find > this > > >>>>>>> batch on > > >>>>>>>>> remote storage. In contrast to classic topics, batch data is in > > >>>>>> remote > > >>>>>>>>> storage. > > >>>>>>>>> > > >>>>>>>>> For clarity, let’s compare the three designs: > > >>>>>>>>> • Classic topics: > > >>>>>>>>> • Data and metadata are co-located in the partition log. > > >>>>>>>>> • The partition log content: [Batch header (metadata)|Batch > > >>>> data]. > > >>>>>>>>> • The partition log is replicated to the followers. > > >>>>>>>>> • The replicas and leader have local state built from > > >>>> metadata. > > >>>>>>>>> • Original Diskless: > > >>>>>>>>> • Metadata is in the batch coordinator, data is on remote > > >>>> storage. > > >>>>>>>>> • The partition state is global in the batch coordinator. > > >>>>>>>>> • New Diskless: > > >>>>>>>>> • Metadata is in the partition log, data is on remote storage. > > >>>>>>>>> • Partition log content: [Batch header (metadata)|Batch > > >>>>>> coordinates > > >>>>>>> on > > >>>>>>>>> remote storage]. > > >>>>>>>>> • The partition log is replicated to the followers. > > >>>>>>>>> • The replicas and leader have local state built from > > >>>> metadata. > > >>>>>>>>> > > >>>>>>>>> Let’s consider the produce path. Here’s the reminder of the > > >>>> original > > >>>>>>>>> Diskless design: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> The new approach could be depicted as the following: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> As you can see, the main difference is that now instead of a > > >>>> single > > >>>>>>> commit > > >>>>>>>>> request to the batch coordinator, we send multiple parallel > > >>>> commit > > >>>>>>> requests > > >>>>>>>>> to all the leaders of each partition involved in the WAL file. > > >>>> Each > > >>>>>> of > > >>>>>>> them > > >>>>>>>>> will commit its batches independently, without coordinating > with > > >>>>>> other > > >>>>>>>>> leaders and any other components. Batch data is addressed by > the > > >>>> WAL > > >>>>>>> file > > >>>>>>>>> name, the byte offset and size, which allows partitions to know > > >>>>>> nothing > > >>>>>>>>> about other partitions to access their data in shared WAL > files. > > >>>>>>>>> > > >>>>>>>>> The number of partitions involved in a single WAL file may be > > >>>> quite > > >>>>>>> large, > > >>>>>>>>> e.g. a hundred. A hundred network requests to commit one WAL > > >>>> file is > > >>>>>>> very > > >>>>>>>>> impractical. However, there are ways to reduce this number: > > >>>>>>>>> 1. Partition leaders are located on brokers. Requests to > > >>>> leaders on > > >>>>>>> one > > >>>>>>>>> broker could be grouped together into a single physical network > > >>>>>> request > > >>>>>>>>> (resembling the normal Produce request that may carry batches > for > > >>>>>> many > > >>>>>>>>> partitions inside). This will cap the number of network > requests > > >>>> to > > >>>>>> the > > >>>>>>>>> number of brokers in the cluster. > > >>>>>>>>> 2. If we craft the cluster metadata to make producers send > their > > >>>>>>> requests > > >>>>>>>>> to the right brokers (with respect to AZs), we may achieve the > > >>>> higher > > >>>>>>>>> concentration of logical commit requests in physical network > > >>>> requests > > >>>>>>>>> reducing the number of the latter ones even further, ideally to > > >>>> one. > > >>>>>>>>> > > >>>>>>>>> Obviously, out of multiple commit requests some may fail or > time > > >>>> out > > >>>>>>> for a > > >>>>>>>>> variety of reasons. This is fine. Some producers will receive > > >>>> totally > > >>>>>>> or > > >>>>>>>>> partially failed responses to their Produce requests, similar > to > > >>>> what > > >>>>>>> they > > >>>>>>>>> would have received when appending to a classic topic fails or > > >>>> times > > >>>>>>> out. > > >>>>>>>>> If a partition experiences problems, other partitions will not > be > > >>>>>>> affected > > >>>>>>>>> (again, like in classic topics). Of course, the uncommitted > data > > >>>> will > > >>>>>>> be > > >>>>>>>>> garbage in WAL files. But WAL files are short-lived (batches > are > > >>>>>>> constantly > > >>>>>>>>> assembled into segments and offloaded to tiered storage), so > this > > >>>>>>> garbage > > >>>>>>>>> will be eventually deleted. > > >>>>>>>>> > > >>>>>>>>> For safely deleting WAL files we now need to centrally manage > > >>>> them, > > >>>>>> as > > >>>>>>>>> this is the only state and logic that spans multiple > partitions. > > >>>> On > > >>>>>> the > > >>>>>>>>> diagram, you can see another commit request called “Commit file > > >>>> (best > > >>>>>>>>> effort)” going to the WAL File Manager. This manager will be > > >>>>>>> responsible > > >>>>>>>>> for the following: > > >>>>>>>>> 1. Collecting (by requests from brokers) and persisting > > >>>> information > > >>>>>>> about > > >>>>>>>>> committed WAL files. > > >>>>>>>>> 2. To handle potential failures in file information delivery, > it > > >>>>>> will > > >>>>>>> be > > >>>>>>>>> doing prefix scan on the remote storage periodically to find > and > > >>>>>>> register > > >>>>>>>>> unknown files. The period of this scan will be configurable and > > >>>>>> ideally > > >>>>>>>>> should be quite long. > > >>>>>>>>> 3. Checking with the relevant partition leaders (after a grace > > >>>>>>> period) if > > >>>>>>>>> they still have batches in a particular file. > > >>>>>>>>> 4. Physically deleting files when they aren’t anymore referred > > >>>> to by > > >>>>>>> any > > >>>>>>>>> partition. > > >>>>>>>>> > > >>>>>>>>> This new design offers the following advantages: > > >>>>>>>>> 1. It simplifies the implementation of many Kafka features such > > >>>> as > > >>>>>>>>> idempotence, transactions, queues, tiered storage, retention. > > >>>> Now we > > >>>>>>> don’t > > >>>>>>>>> need to abstract away and reuse the code from partition leaders > > >>>> in > > >>>>>> the > > >>>>>>>>> batch coordinator. Instead, we will literally use the same code > > >>>> paths > > >>>>>>> in > > >>>>>>>>> leaders, with little adaptation. Workflows from classic topics > > >>>> mostly > > >>>>>>>>> remain unchanged. > > >>>>>>>>> For example, it seems that > > >>>>>>>>> ReplicaManager.maybeSendPartitionsToTransactionCoordinator and > > >>>>>>>>> KafkaApis.handleWriteTxnMarkersRequest used for transaction > > >>>> support > > >>>>>> on > > >>>>>>> the > > >>>>>>>>> partition leader side could be used for diskless topics with > > >>>> little > > >>>>>>>>> adaptation. ProducerStateManager, needed for both idempotent > > >>>> produce > > >>>>>>> and > > >>>>>>>>> transactions, would be reused. > > >>>>>>>>> Another example is share groups support, where the share > > >>>> partition > > >>>>>>> leader, > > >>>>>>>>> being co-located with the partition leader, would execute the > > >>>> same > > >>>>>>> logic > > >>>>>>>>> for both diskless and classic topics. > > >>>>>>>>> 2. It returns to the familiar partition-based scaling model, > > >>>> where > > >>>>>>>>> partitions are independent. > > >>>>>>>>> 3. It makes the operation and failure patterns closer to the > > >>>>>> familiar > > >>>>>>>>> ones from classic topics. > > >>>>>>>>> 4. It opens a straightforward path to seamless switching the > > >>>> topics > > >>>>>>> modes > > >>>>>>>>> between diskless and classic. > > >>>>>>>>> > > >>>>>>>>> The rest of the things remain unchanged compared to the > previous > > >>>>>>> Diskless > > >>>>>>>>> design (after all previous discussions). Such things as local > > >>>> segment > > >>>>>>>>> materialization by replicas, the consume path, tiered storage > > >>>>>>> integration, > > >>>>>>>>> etc. > > >>>>>>>>> > > >>>>>>>>> If the community finds this design more suitable, we will > update > > >>>> the > > >>>>>>>>> KIP(s) accordingly and continue working on it. Please let us > know > > >>>>>> what > > >>>>>>> you > > >>>>>>>>> think. > > >>>>>>>>> > > >>>>>>>>> Best regards, > > >>>>>>>>> Ivan and Diskless team > > >>>>>>>>> > > >>>>>>>>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote: > > >>>>>>>>>> Hi Justine, > > >>>>>>>>>> > > >>>>>>>>>> Yes, you're right. We need to track the aborted transactions > > >>>> for in > > >>>>>>> the > > >>>>>>>>> diskless coordinator for as long as the corresponding offsets > are > > >>>>>>> there. > > >>>>>>>>> With the tiered storage unification Greg mentioned earlier, > this > > >>>> will > > >>>>>>> be > > >>>>>>>>> finite time even for infinite data retention. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Ivan > > >>>>>>>>>> > > >>>>>>>>>> On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote: > > >>>>>>>>>>> Hey Ivan, > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for the response. I think most of what you said made > > >>>>>> sense, > > >>>>>>> but > > >>>>>>>>> I > > >>>>>>>>>>> did have some questions about this part: > > >>>>>>>>>>> > > >>>>>>>>>>>> As we understand this, the partition leader in classic > > >>>> topics > > >>>>>>> forgets > > >>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses > > >>>> it). The > > >>>>>>>>>>> transaction coordinator acts like the main guardian, allowing > > >>>>>>> partition > > >>>>>>>>>>> leaders to do this safely. Please correct me if this is > > >>>> wrong. We > > >>>>>>> think > > >>>>>>>>>>> about relying on this with the batch coordinator and delete > > >>>> the > > >>>>>>>>> information > > >>>>>>>>>>> about a transaction once it’s finished (as there’s no > > >>>> replication > > >>>>>>> and > > >>>>>>>>> HWM > > >>>>>>>>>>> advances immediately). > > >>>>>>>>>>> > > >>>>>>>>>>> I didn't quite understand this. In classic topics, we have > > >>>> maps > > >>>>>> for > > >>>>>>>>> ongoing > > >>>>>>>>>>> transactions which remove state when the transaction is > > >>>> completed > > >>>>>>> and > > >>>>>>>>> an > > >>>>>>>>>>> aborted transactions index which is retained for much longer. > > >>>>>> Once > > >>>>>>> the > > >>>>>>>>>>> transaction is completed, the coordinator is no longer > > >>>> involved > > >>>>>> in > > >>>>>>>>>>> maintaining this partition side state, and it is subject to > > >>>>>>> compaction > > >>>>>>>>> etc. > > >>>>>>>>>>> Looking back at the outline provided above, I didn't see much > > >>>>>>> about the > > >>>>>>>>>>> fetch path, so maybe that could be expanded a bit further. I > > >>>> saw > > >>>>>>> the > > >>>>>>>>>>> following in a response: > > >>>>>>>>>>>> When the broker constructs a fully valid local segment, > > >>>> all the > > >>>>>>>>> necessary > > >>>>>>>>>>> control batches will be inserted and indices, including the > > >>>>>>> transaction > > >>>>>>>>>>> index will be built to serve FetchRequests exactly as they > > >>>> are > > >>>>>>> today. > > >>>>>>>>>>> > > >>>>>>>>>>> Based on this, it seems like we need to retain the > > >>>> information > > >>>>>>> about > > >>>>>>>>>>> aborted txns for longer. > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks, > > >>>>>>>>>>> Justine > > >>>>>>>>>>> > > >>>>>>>>>>> On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko < > > >>>> [email protected]> > > >>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Hi Justine and all, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thank you for your questions! > > >>>>>>>>>>>> > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified > > >>>> with > > >>>>>>>>> producer ID > > >>>>>>>>>>>>> and epoch, the positive result of this check could be > > >>>> cached > > >>>>>>>>> locally > > >>>>>>>>>>>>> Are we saying that only new transaction version 2 > > >>>>>> transactions > > >>>>>>> can > > >>>>>>>>> be > > >>>>>>>>>>>> used > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions > > >>>> with > > >>>>>>>>> producer id + > > >>>>>>>>>>>>> epoch > > >>>>>>>>>>>> > > >>>>>>>>>>>> You’re right that we (probably unintentionally) focused > > >>>> only on > > >>>>>>>>> version 2. > > >>>>>>>>>>>> We can either limit the support to version 2 or consider > > >>>> using > > >>>>>>> some > > >>>>>>>>>>>> surrogates to support version 1. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final transactional > > >>>>>>> checks > > >>>>>>>>> of the > > >>>>>>>>>>>>> batches. This procedure would output the same errors > > >>>> like the > > >>>>>>>>> partition > > >>>>>>>>>>>>> leader in classic topics would do. > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be > > >>>>>> checking > > >>>>>>> if > > >>>>>>>>> the > > >>>>>>>>>>>>> transaction was still ongoing for example?* * > > >>>>>>>>>>>> > > >>>>>>>>>>>> Yes, the producer epoch, that the transaction is ongoing, > > >>>> and > > >>>>>> of > > >>>>>>>>> course > > >>>>>>>>>>>> the normal idempotence checks. What the partition leader > > >>>> in the > > >>>>>>>>> classic > > >>>>>>>>>>>> topics does before appending a batch to the local log > > >>>> (e.g. in > > >>>>>>>>>>>> UnifiedLog.maybeStartTransactionVerification and > > >>>>>>>>>>>> UnifiedLog.analyzeAndValidateProducerState). In Diskless, > > >>>> we > > >>>>>>>>> unfortunately > > >>>>>>>>>>>> cannot do these checks before appending the data to the WAL > > >>>>>>> segment > > >>>>>>>>> and > > >>>>>>>>>>>> uploading it, but we can “tombstone” these batches in the > > >>>> batch > > >>>>>>>>> coordinator > > >>>>>>>>>>>> during the final commit. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Is there state about ongoing > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other > > >>>> state > > >>>>>>>>> mentioned > > >>>>>>>>>>>> in > > >>>>>>>>>>>>> the End transaction section, but it's not super clear > > >>>> what > > >>>>>>> state is > > >>>>>>>>>>>> stored > > >>>>>>>>>>>>> and when it is stored. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Right, this should have been more explicit. As the > > >>>> partition > > >>>>>>> leader > > >>>>>>>>> tracks > > >>>>>>>>>>>> ongoing transactions for classic topics, the batch > > >>>> coordinator > > >>>>>>> has > > >>>>>>>>> to as > > >>>>>>>>>>>> well. So when a transaction starts and ends, the > > >>>> transaction > > >>>>>>>>> coordinator > > >>>>>>>>>>>> must inform the batch coordinator about this. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO -- > > >>>> perhaps > > >>>>>>> that > > >>>>>>>>> would > > >>>>>>>>>>>> be > > >>>>>>>>>>>>> stored in the batch coordinator? > > >>>>>>>>>>>> > > >>>>>>>>>>>> Yes. This could be deduced from the committed batches and > > >>>> other > > >>>>>>>>>>>> information, but for the sake of performance we’d better > > >>>> store > > >>>>>> it > > >>>>>>>>>>>> explicitly. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long transactional > > >>>>>>> state is > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be > > >>>>>> cleaned > > >>>>>>> up? > > >>>>>>>>>>>> > > >>>>>>>>>>>> As we understand this, the partition leader in classic > > >>>> topics > > >>>>>>> forgets > > >>>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses > > >>>> it). > > >>>>>> The > > >>>>>>>>>>>> transaction coordinator acts like the main guardian, > > >>>> allowing > > >>>>>>>>> partition > > >>>>>>>>>>>> leaders to do this safely. Please correct me if this is > > >>>> wrong. > > >>>>>> We > > >>>>>>>>> think > > >>>>>>>>>>>> about relying on this with the batch coordinator and > > >>>> delete the > > >>>>>>>>> information > > >>>>>>>>>>>> about a transaction once it’s finished (as there’s no > > >>>>>> replication > > >>>>>>>>> and HWM > > >>>>>>>>>>>> advances immediately). > > >>>>>>>>>>>> > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Ivan > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote: > > >>>>>>>>>>>>> Hey folks, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Excited to see some updates related to transactions! > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I had a few questions. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified > > >>>> with > > >>>>>>>>> producer ID > > >>>>>>>>>>>>> and epoch, the positive result of this check could be > > >>>> cached > > >>>>>>>>> locally > > >>>>>>>>>>>>> Are we saying that only new transaction version 2 > > >>>>>> transactions > > >>>>>>> can > > >>>>>>>>> be > > >>>>>>>>>>>> used > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions > > >>>> with > > >>>>>>>>> producer id + > > >>>>>>>>>>>>> epoch > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final transactional > > >>>>>>> checks > > >>>>>>>>> of the > > >>>>>>>>>>>>> batches. This procedure would output the same errors > > >>>> like the > > >>>>>>>>> partition > > >>>>>>>>>>>>> leader in classic topics would do. > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be > > >>>>>> checking > > >>>>>>> if > > >>>>>>>>> the > > >>>>>>>>>>>>> transaction was still ongoing for example? Is there state > > >>>>>> about > > >>>>>>>>> ongoing > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other > > >>>> state > > >>>>>>>>> mentioned > > >>>>>>>>>>>> in > > >>>>>>>>>>>>> the End transaction section, but it's not super clear > > >>>> what > > >>>>>>> state is > > >>>>>>>>>>>> stored > > >>>>>>>>>>>>> and when it is stored. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO -- > > >>>> perhaps > > >>>>>>> that > > >>>>>>>>> would > > >>>>>>>>>>>> be > > >>>>>>>>>>>>> stored in the batch coordinator? > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long transactional > > >>>>>>> state is > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be > > >>>>>> cleaned > > >>>>>>> up? > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao > > >>>>>>> <[email protected]> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi, Greg and Ivan, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for the update. A few comments. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> JR 10. "Consumer fetches are now served from local > > >>>>>> segments, > > >>>>>>>>> making > > >>>>>>>>>>>> use of > > >>>>>>>>>>>>>> the > > >>>>>>>>>>>>>> indexes, page cache, request purgatory, and zero-copy > > >>>>>>>>> functionality > > >>>>>>>>>>>> already > > >>>>>>>>>>>>>> built into classic topics." > > >>>>>>>>>>>>>> JR 10.1 Does the broker build the producer state for > > >>>> each > > >>>>>>>>> partition in > > >>>>>>>>>>>>>> diskless topics? > > >>>>>>>>>>>>>> JR 10.2 For transactional data, the consumer fetches > > >>>> need > > >>>>>> to > > >>>>>>> know > > >>>>>>>>>>>> aborted > > >>>>>>>>>>>>>> records. How is that achieved? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> JR 11. "The batch coordinator saves that the > > >>>> transaction is > > >>>>>>>>> finished > > >>>>>>>>>>>> and > > >>>>>>>>>>>>>> also inserts the control batches in the corresponding > > >>>> logs > > >>>>>>> of the > > >>>>>>>>>>>> involved > > >>>>>>>>>>>>>> Diskless topics. This happens only on the metadata > > >>>> level, > > >>>>>> no > > >>>>>>>>> actual > > >>>>>>>>>>>> control > > >>>>>>>>>>>>>> batches are written to any file. " > > >>>>>>>>>>>>>> A fetch response could include multiple transactional > > >>>>>>> batches. > > >>>>>>>>> How > > >>>>>>>>>>>> does the > > >>>>>>>>>>>>>> broker obtain the information about the ending control > > >>>>>> batch > > >>>>>>> for > > >>>>>>>>> each > > >>>>>>>>>>>>>> batch? Does that mean that a fetch response needs to be > > >>>>>>> built by > > >>>>>>>>>>>>>> stitching record batches and generated control batches > > >>>>>>> together? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> JR 12. Queues: Is there still a share partition leader > > >>>> that > > >>>>>>> all > > >>>>>>>>>>>> consumers > > >>>>>>>>>>>>>> are routed to? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> JR 13. "Should the KIPs be modified to include this or > > >>>> it's > > >>>>>>> too > > >>>>>>>>>>>>>> implementation-focused?" It would be useful to include > > >>>>>> enough > > >>>>>>>>> details > > >>>>>>>>>>>> to > > >>>>>>>>>>>>>> understand correctness and performance impact. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> HC5. Henry has a valid point. Requests from a given > > >>>>>> producer > > >>>>>>>>> contain a > > >>>>>>>>>>>>>> sequence number, which is ordered. If a producer sends > > >>>>>> every > > >>>>>>>>> Produce > > >>>>>>>>>>>>>> request to an arbitrary broker, those requests could > > >>>> reach > > >>>>>>> the > > >>>>>>>>> batch > > >>>>>>>>>>>>>> coordinator in different order and lead to rejection > > >>>> of the > > >>>>>>>>> produce > > >>>>>>>>>>>>>> requests. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Jun > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko < > > >>>>>>> [email protected]> > > >>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Hi all, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> We have also thought in a bit more details about > > >>>>>>> transactions > > >>>>>>>>> and > > >>>>>>>>>>>> queues, > > >>>>>>>>>>>>>>> here's the plan. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> *Transactions* > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The support for transactions in *classic topics* is > > >>>> based > > >>>>>>> on > > >>>>>>>>> precise > > >>>>>>>>>>>>>>> interactions between three actors: clients (mostly > > >>>>>>> producers, > > >>>>>>>>> but > > >>>>>>>>>>>> also > > >>>>>>>>>>>>>>> consumers), brokers (ReplicaManager and other > > >>>> classes), > > >>>>>> and > > >>>>>>>>>>>> transaction > > >>>>>>>>>>>>>>> coordinators. Brokers also run partition leaders with > > >>>>>> their > > >>>>>>>>> local > > >>>>>>>>>>>> state > > >>>>>>>>>>>>>>> (ProducerStateManager and others). > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The high level (some details skipped) workflow is the > > >>>>>>>>> following. > > >>>>>>>>>>>> When a > > >>>>>>>>>>>>>>> transactional Produce request is received by the > > >>>> broker: > > >>>>>>>>>>>>>>> 1. For each partition, the partition leader checks > > >>>> if a > > >>>>>>>>> non-empty > > >>>>>>>>>>>>>>> transaction is running for this partition. This is > > >>>> done > > >>>>>>> using > > >>>>>>>>> its > > >>>>>>>>>>>> local > > >>>>>>>>>>>>>>> state derived from the log metadata > > >>>>>> (ProducerStateManager, > > >>>>>>>>>>>>>>> VerificationStateEntry, VerificationGuard). > > >>>>>>>>>>>>>>> 2. The transaction coordinator is informed about all > > >>>> the > > >>>>>>>>> partitions > > >>>>>>>>>>>> that > > >>>>>>>>>>>>>>> aren’t part of the transaction to include them. > > >>>>>>>>>>>>>>> 3. The partition leaders do additional transactional > > >>>>>>> checks. > > >>>>>>>>>>>>>>> 4. The partition leaders append the transactional > > >>>> data to > > >>>>>>>>> their logs > > >>>>>>>>>>>> and > > >>>>>>>>>>>>>>> update some of their state (for example, log the fact > > >>>>>> that > > >>>>>>> the > > >>>>>>>>>>>>>> transaction > > >>>>>>>>>>>>>>> is running for the partition and its first offset). > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> When the transaction is committed or aborted: > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator > > >>>>>>> directly > > >>>>>>>>> with > > >>>>>>>>>>>>>>> EndTxnRequest. > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT > > >>>> or > > >>>>>>>>>>>> PREPARE_ABORT to > > >>>>>>>>>>>>>>> its log and responds to the producer. > > >>>>>>>>>>>>>>> 3. The transaction coordinator sends > > >>>>>>> WriteTxnMarkersRequest to > > >>>>>>>>> the > > >>>>>>>>>>>>>> leaders > > >>>>>>>>>>>>>>> of the involved partitions. > > >>>>>>>>>>>>>>> 4. The partition leaders write the transaction > > >>>> markers to > > >>>>>>>>> their logs > > >>>>>>>>>>>> and > > >>>>>>>>>>>>>>> respond to the coordinator. > > >>>>>>>>>>>>>>> 5. The coordinator writes the final transaction state > > >>>>>>>>>>>> COMPLETE_COMMIT or > > >>>>>>>>>>>>>>> COMPLETE_ABORT. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> In classic topics, partitions have leaders and lots > > >>>> of > > >>>>>>>>> important > > >>>>>>>>>>>> state > > >>>>>>>>>>>>>>> necessary for supporting this workflow is local. The > > >>>> main > > >>>>>>>>> challenge > > >>>>>>>>>>>> in > > >>>>>>>>>>>>>>> mapping this to Diskless comes from the fact there > > >>>> are no > > >>>>>>>>> partition > > >>>>>>>>>>>>>>> leaders, so the corresponding pieces of state need > > >>>> to be > > >>>>>>>>> globalized > > >>>>>>>>>>>> in > > >>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> batch coordinator. We are already doing this to > > >>>> support > > >>>>>>>>> idempotent > > >>>>>>>>>>>>>> produce. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The high level workflow for *diskless topics* would > > >>>> look > > >>>>>>> very > > >>>>>>>>>>>> similar: > > >>>>>>>>>>>>>>> 1. For each partition, the broker checks if a > > >>>> non-empty > > >>>>>>>>> transaction > > >>>>>>>>>>>> is > > >>>>>>>>>>>>>>> running for this partition. In contrast to classic > > >>>>>> topics, > > >>>>>>>>> this is > > >>>>>>>>>>>>>> checked > > >>>>>>>>>>>>>>> against the batch coordinator with a single RPC. > > >>>> Since a > > >>>>>>>>> transaction > > >>>>>>>>>>>>>> could > > >>>>>>>>>>>>>>> be uniquely identified with producer ID and epoch, > > >>>> the > > >>>>>>> positive > > >>>>>>>>>>>> result of > > >>>>>>>>>>>>>>> this check could be cached locally (for the double > > >>>>>>> configured > > >>>>>>>>>>>> duration > > >>>>>>>>>>>>>> of a > > >>>>>>>>>>>>>>> transaction, for example). > > >>>>>>>>>>>>>>> 2. The same: The transaction coordinator is informed > > >>>>>> about > > >>>>>>> all > > >>>>>>>>> the > > >>>>>>>>>>>>>>> partitions that aren’t part of the transaction to > > >>>> include > > >>>>>>> them. > > >>>>>>>>>>>>>>> 3. No transactional checks are done on the broker > > >>>> side. > > >>>>>>>>>>>>>>> 4. The broker appends the transactional data to the > > >>>>>> current > > >>>>>>>>> shared > > >>>>>>>>>>>> WAL > > >>>>>>>>>>>>>>> segment. It doesn’t update any transaction-related > > >>>> state > > >>>>>>> for > > >>>>>>>>> Diskless > > >>>>>>>>>>>>>>> topics, because it doesn’t have any. > > >>>>>>>>>>>>>>> 5. The WAL segment is committed to the batch > > >>>> coordinator > > >>>>>>> like > > >>>>>>>>> in the > > >>>>>>>>>>>>>>> normal produce flow. > > >>>>>>>>>>>>>>> 6. The batch coordinator does the final transactional > > >>>>>>> checks > > >>>>>>>>> of the > > >>>>>>>>>>>>>>> batches. This procedure would output the same errors > > >>>> like > > >>>>>>> the > > >>>>>>>>>>>> partition > > >>>>>>>>>>>>>>> leader in classic topics would do. I.e. some batches > > >>>>>> could > > >>>>>>> be > > >>>>>>>>>>>> rejected. > > >>>>>>>>>>>>>>> This means, there will potentially be garbage in the > > >>>> WAL > > >>>>>>>>> segment > > >>>>>>>>>>>> file in > > >>>>>>>>>>>>>>> case of transactional errors. This is preferable to > > >>>> doing > > >>>>>>> more > > >>>>>>>>>>>> network > > >>>>>>>>>>>>>>> round trips, especially considering the WAL segments > > >>>> will > > >>>>>>> be > > >>>>>>>>>>>> relatively > > >>>>>>>>>>>>>>> short-living (see the Greg's update above). > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> When the transaction is committed or aborted: > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator > > >>>>>>> directly > > >>>>>>>>> with > > >>>>>>>>>>>>>>> EndTxnRequest. > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT > > >>>> or > > >>>>>>>>>>>> PREPARE_ABORT to > > >>>>>>>>>>>>>>> its log and responds to the producer. > > >>>>>>>>>>>>>>> 3. *[NEW]* The transaction coordinator informs the > > >>>> batch > > >>>>>>>>> coordinator > > >>>>>>>>>>>> that > > >>>>>>>>>>>>>>> the transaction is finished. > > >>>>>>>>>>>>>>> 4. *[NEW]* The batch coordinator saves that the > > >>>>>>> transaction is > > >>>>>>>>>>>> finished > > >>>>>>>>>>>>>>> and also inserts the control batches in the > > >>>> corresponding > > >>>>>>> logs > > >>>>>>>>> of the > > >>>>>>>>>>>>>>> involved Diskless topics. This happens only on the > > >>>>>> metadata > > >>>>>>>>> level, no > > >>>>>>>>>>>>>>> actual control batches are written to any file. They > > >>>> will > > >>>>>>> be > > >>>>>>>>>>>> dynamically > > >>>>>>>>>>>>>>> created on Fetch and other read operations. We could > > >>>>>>>>> technically > > >>>>>>>>>>>> write > > >>>>>>>>>>>>>>> these control batches for real, but this would mean > > >>>> extra > > >>>>>>>>> produce > > >>>>>>>>>>>>>> latency, > > >>>>>>>>>>>>>>> so it's better just to mark them in the batch > > >>>> coordinator > > >>>>>>> and > > >>>>>>>>> save > > >>>>>>>>>>>> these > > >>>>>>>>>>>>>>> milliseconds. > > >>>>>>>>>>>>>>> 5. The transaction coordinator sends > > >>>>>>> WriteTxnMarkersRequest to > > >>>>>>>>> the > > >>>>>>>>>>>>>> leaders > > >>>>>>>>>>>>>>> of the involved partitions. – Now only to classic > > >>>> topics > > >>>>>>> now. > > >>>>>>>>>>>>>>> 6. The partition leaders of classic topics write the > > >>>>>>>>> transaction > > >>>>>>>>>>>> markers > > >>>>>>>>>>>>>>> to their logs and respond to the coordinator. > > >>>>>>>>>>>>>>> 7. The coordinator writes the final transaction state > > >>>>>>>>>>>> COMPLETE_COMMIT or > > >>>>>>>>>>>>>>> COMPLETE_ABORT. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Compared to the non-transactional produce flow, we > > >>>> get: > > >>>>>>>>>>>>>>> 1. An extra network round trip between brokers and > > >>>> the > > >>>>>>> batch > > >>>>>>>>>>>> coordinator > > >>>>>>>>>>>>>>> when a new partition appear in the transaction. To > > >>>>>>> mitigate the > > >>>>>>>>>>>> impact of > > >>>>>>>>>>>>>>> them: > > >>>>>>>>>>>>>>> - The results will be cached. > > >>>>>>>>>>>>>>> - The calls for multiple partitions in one Produce > > >>>>>>> request > > >>>>>>>>> will be > > >>>>>>>>>>>>>>> grouped. > > >>>>>>>>>>>>>>> - The batch coordinator should be optimized for > > >>>> fast > > >>>>>>>>> response to > > >>>>>>>>>>>> these > > >>>>>>>>>>>>>>> RPCs. > > >>>>>>>>>>>>>>> - The fact that a single producer normally will > > >>>>>>> communicate > > >>>>>>>>> with a > > >>>>>>>>>>>>>>> single broker for the duration of the transaction > > >>>> further > > >>>>>>>>> reduces the > > >>>>>>>>>>>>>>> expected number of round trips. > > >>>>>>>>>>>>>>> 2. An extra round trip between the transaction > > >>>>>> coordinator > > >>>>>>> and > > >>>>>>>>> batch > > >>>>>>>>>>>>>>> coordinator when a transaction is finished. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> With this proposal, transactions will also be able to > > >>>>>> span > > >>>>>>> both > > >>>>>>>>>>>> classic > > >>>>>>>>>>>>>>> and Diskless topics. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> *Queues* > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The share group coordination and management is a > > >>>> side job > > >>>>>>> that > > >>>>>>>>>>>> doesn't > > >>>>>>>>>>>>>>> interfere with the topic itself (leadership, > > >>>> replicas, > > >>>>>>> physical > > >>>>>>>>>>>> storage > > >>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>> records, etc.) and non-queue producers and consumers > > >>>>>>> (Fetch and > > >>>>>>>>>>>> Produce > > >>>>>>>>>>>>>>> RPCs, consumer group-related RPCs are not affected.) > > >>>> We > > >>>>>>> don't > > >>>>>>>>> see any > > >>>>>>>>>>>>>>> reason why we can't make Diskless topics compatible > > >>>> with > > >>>>>>> share > > >>>>>>>>>>>> groups the > > >>>>>>>>>>>>>>> same way as classic topics are. Even on the code > > >>>> level, > > >>>>>> we > > >>>>>>>>> don't > > >>>>>>>>>>>> expect > > >>>>>>>>>>>>>> any > > >>>>>>>>>>>>>>> serious refactoring: the same reading routines are > > >>>> used > > >>>>>>> that > > >>>>>>>>> are > > >>>>>>>>>>>> used for > > >>>>>>>>>>>>>>> fetching (e.g. ReplicaManager.readFromLog). > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Should the KIPs be modified to include this or it's > > >>>> too > > >>>>>>>>>>>>>>> implementation-focused? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Best regards, > > >>>>>>>>>>>>>>> Ivan > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote: > > >>>>>>>>>>>>>>>> Hi all, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thank you all for your questions and design input > > >>>> on > > >>>>>>>>> KIP-1150. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> We have just updated KIP-1150 and KIP-1163 with a > > >>>> new > > >>>>>>>>> design. To > > >>>>>>>>>>>>>>> summarize > > >>>>>>>>>>>>>>>> the changes: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> 1. The design prioritizes integrating with the > > >>>> existing > > >>>>>>>>> KIP-405 > > >>>>>>>>>>>> Tiered > > >>>>>>>>>>>>>>>> Storage interfaces, permitting data produced to a > > >>>>>>> Diskless > > >>>>>>>>> topic > > >>>>>>>>>>>> to be > > >>>>>>>>>>>>>>>> moved to tiered storage. > > >>>>>>>>>>>>>>>> This lowers the scalability requirements for the > > >>>> Batch > > >>>>>>>>> Coordinator > > >>>>>>>>>>>>>>>> component, and allows Diskless to compose with > > >>>> Tiered > > >>>>>>> Storage > > >>>>>>>>>>>> plugin > > >>>>>>>>>>>>>>>> features such as encryption and alternative data > > >>>>>> formats. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> 2. Consumer fetches are now served from local > > >>>> segments, > > >>>>>>>>> making use > > >>>>>>>>>>>> of > > >>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> indexes, page cache, request purgatory, and > > >>>> zero-copy > > >>>>>>>>> functionality > > >>>>>>>>>>>>>>> already > > >>>>>>>>>>>>>>>> built into classic topics. > > >>>>>>>>>>>>>>>> However, local segments are now considered cache > > >>>>>>> elements, > > >>>>>>>>> do not > > >>>>>>>>>>>> need > > >>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>> be durably stored, and can be built without > > >>>> contacting > > >>>>>>> any > > >>>>>>>>> other > > >>>>>>>>>>>>>>> replicas. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> 3. The design has been simplified substantially, by > > >>>>>>> removing > > >>>>>>>>> the > > >>>>>>>>>>>>>> previous > > >>>>>>>>>>>>>>>> Diskless consume flow, distributed cache > > >>>> component, and > > >>>>>>>>> "object > > >>>>>>>>>>>>>>>> compaction/merging" step. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The design maintains leaderless produces as > > >>>> enabled by > > >>>>>>> the > > >>>>>>>>> Batch > > >>>>>>>>>>>>>>>> Coordinator, and the same latency profiles as the > > >>>>>> earlier > > >>>>>>>>> design, > > >>>>>>>>>>>> while > > >>>>>>>>>>>>>>>> being simpler and integrating better into the > > >>>> existing > > >>>>>>>>> ecosystem. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thanks, and we are eager to hear your feedback on > > >>>> the > > >>>>>> new > > >>>>>>>>> design. > > >>>>>>>>>>>>>>>> Greg Harris > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao > > >>>>>>>>> <[email protected]> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Hi, Jan, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> For me, the main gap of KIP-1150 is the support > > >>>> of > > >>>>>> all > > >>>>>>>>> existing > > >>>>>>>>>>>>>> client > > >>>>>>>>>>>>>>>>> APIs. Currently, there is no design for > > >>>> supporting > > >>>>>> APIs > > >>>>>>>>> like > > >>>>>>>>>>>>>>> transactions > > >>>>>>>>>>>>>>>>> and queues. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Jun > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski > > >>>>>>>>>>>>>>>>> <[email protected]> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Would it be a good time to ask for the current > > >>>>>>> status of > > >>>>>>>>> this > > >>>>>>>>>>>> KIP? > > >>>>>>>>>>>>>> I > > >>>>>>>>>>>>>>>>>> haven't seen much activity here for the past 2 > > >>>>>>> months, > > >>>>>>>>> the > > >>>>>>>>>>>> vote got > > >>>>>>>>>>>>>>>>> vetoed > > >>>>>>>>>>>>>>>>>> but I think the pending questions have been > > >>>>>> answered > > >>>>>>>>> since > > >>>>>>>>>>>> then. > > >>>>>>>>>>>>>>> KIP-1183 > > >>>>>>>>>>>>>>>>>> (AutoMQ's proposal) also didn't have any > > >>>> activity > > >>>>>>> since > > >>>>>>>>> May. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> In my eyes KIP-1150 and KIP-1183 are two real > > >>>>>> choices > > >>>>>>>>> that can > > >>>>>>>>>>>> be > > >>>>>>>>>>>>>>>>>> made, with a coordinator-based approach being > > >>>> by > > >>>>>> far > > >>>>>>> the > > >>>>>>>>>>>> dominant > > >>>>>>>>>>>>>> one > > >>>>>>>>>>>>>>>>> when > > >>>>>>>>>>>>>>>>>> it comes to market adoption - but all these are > > >>>>>>>>> standalone > > >>>>>>>>>>>>>> products. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> I'm a big fan of both approaches, but would > > >>>> hate to > > >>>>>>> see a > > >>>>>>>>>>>> stall. So > > >>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> question is: can we get an update? > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Maybe it's time to start another vote? Colin > > >>>>>> McCabe - > > >>>>>>>>> have your > > >>>>>>>>>>>>>>> questions > > >>>>>>>>>>>>>>>>>> been answered? If not, is there anything I can > > >>>> do > > >>>>>> to > > >>>>>>>>> help? I'm > > >>>>>>>>>>>>>> deeply > > >>>>>>>>>>>>>>>>>> familiar with both architectures and have > > >>>> written > > >>>>>>> about > > >>>>>>>>> both? > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Kind regards, > > >>>>>>>>>>>>>>>>>> Jan > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2025 at 10:42 AM Stanislav > > >>>>>> Kozlovski > > >>>>>>> < > > >>>>>>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> I have some nits - it may be useful to > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> a) group all the KIP email threads in the > > >>>> main > > >>>>>> one > > >>>>>>>>> (just a > > >>>>>>>>>>>> bunch > > >>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>> links > > >>>>>>>>>>>>>>>>>>> to everything) > > >>>>>>>>>>>>>>>>>>> b) create the email threads > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> It's a bit hard to track it all - for > > >>>> example, I > > >>>>>>> was > > >>>>>>>>>>>> searching > > >>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>>>>> discuss thread for KIP-1165 for a while; As > > >>>> far > > >>>>>> as > > >>>>>>> I > > >>>>>>>>> can > > >>>>>>>>>>>> tell, it > > >>>>>>>>>>>>>>>>> doesn't > > >>>>>>>>>>>>>>>>>>> exist yet. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Since the KIPs are published (by virtue of > > >>>> having > > >>>>>>> the > > >>>>>>>>> root > > >>>>>>>>>>>> KIP be > > >>>>>>>>>>>>>>>>>>> published, having a DISCUSS thread and links > > >>>> to > > >>>>>>>>> sub-KIPs > > >>>>>>>>>>>> where > > >>>>>>>>>>>>>> were > > >>>>>>>>>>>>>>>>> aimed > > >>>>>>>>>>>>>>>>>>> to move the discussion towards), I think it > > >>>> would > > >>>>>>> be > > >>>>>>>>> good to > > >>>>>>>>>>>>>> create > > >>>>>>>>>>>>>>>>>> DISCUSS > > >>>>>>>>>>>>>>>>>>> threads for them all. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>> Stan > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 2025/04/16 11:58:22 Josep Prat wrote: > > >>>>>>>>>>>>>>>>>>>> Hi Kafka Devs! > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We want to start a new KIP discussion about > > >>>>>>>>> introducing a > > >>>>>>>>>>>> new > > >>>>>>>>>>>>>>> type of > > >>>>>>>>>>>>>>>>>>>> topics that would make use of Object > > >>>> Storage as > > >>>>>>> the > > >>>>>>>>> primary > > >>>>>>>>>>>>>>> source of > > >>>>>>>>>>>>>>>>>>>> storage. However, as this KIP is big we > > >>>> decided > > >>>>>>> to > > >>>>>>>>> split it > > >>>>>>>>>>>>>> into > > >>>>>>>>>>>>>>>>>> multiple > > >>>>>>>>>>>>>>>>>>>> related KIPs. > > >>>>>>>>>>>>>>>>>>>> We have the motivational KIP-1150 ( > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > >>>>>>>>>>>>>>>>>>> ) > > >>>>>>>>>>>>>>>>>>>> that aims to discuss if Apache Kafka > > >>>> should aim > > >>>>>>> to > > >>>>>>>>> have > > >>>>>>>>>>>> this > > >>>>>>>>>>>>>>> type of > > >>>>>>>>>>>>>>>>>>>> feature at all. This KIP doesn't go onto > > >>>>>> details > > >>>>>>> on > > >>>>>>>>> how to > > >>>>>>>>>>>>>>> implement > > >>>>>>>>>>>>>>>>>> it. > > >>>>>>>>>>>>>>>>>>>> This follows the same approach used when we > > >>>>>>> discussed > > >>>>>>>>>>>> KRaft. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> But as we know that it is sometimes really > > >>>> hard > > >>>>>>> to > > >>>>>>>>> discuss > > >>>>>>>>>>>> on > > >>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>> meta > > >>>>>>>>>>>>>>>>>>>> level, we also created several sub-kips > > >>>> (linked > > >>>>>>> in > > >>>>>>>>>>>> KIP-1150) > > >>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>> offer > > >>>>>>>>>>>>>>>>>>> an > > >>>>>>>>>>>>>>>>>>>> implementation of this feature. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We kindly ask you to use the proper DISCUSS > > >>>>>>> threads > > >>>>>>>>> for > > >>>>>>>>>>>> each > > >>>>>>>>>>>>>>> type of > > >>>>>>>>>>>>>>>>>>>> concern and keep this one to discuss > > >>>> whether > > >>>>>>> Apache > > >>>>>>>>> Kafka > > >>>>>>>>>>>> wants > > >>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>> have > > >>>>>>>>>>>>>>>>>>>> this feature or not. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks in advance on behalf of all the > > >>>> authors > > >>>>>> of > > >>>>>>>>> this KIP. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> ------------------ > > >>>>>>>>>>>>>>>>>>>> Josep Prat > > >>>>>>>>>>>>>>>>>>>> Open Source Engineering Director, Aiven > > >>>>>>>>>>>>>>>>>>>> [email protected] | +491715557497 | > > >>>>>>> aiven.io > > >>>>>>>>>>>>>>>>>>>> Aiven Deutschland GmbH > > >>>>>>>>>>>>>>>>>>>> Alexanderufer 3-7, 10117 Berlin > > >>>>>>>>>>>>>>>>>>>> Geschäftsführer: Oskari Saarenmaa, Hannu > > >>>>>>> Valtonen, > > >>>>>>>>>>>>>>>>>>>> Anna Richardson, Kenneth Chen > > >>>>>>>>>>>>>>>>>>>> Amtsgericht Charlottenburg, HRB 209739 B > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >> > > >> > > > > >
