Re: [DISCUSS] KIP-1150 Diskless Topics

Luke Chen Fri, 14 Nov 2025 03:35:01 -0800

Hi Greg,

Thanks for sharing the meeting notes.
I agree we should keep polishing the contents of 1150 & high level design
in 1163 to prepare for a vote.



Thanks.
Luke

On Fri, Nov 14, 2025 at 3:54 AM Greg Harris <[email protected]>
wrote:

> Hi all,
>
> There was a video call between myself, Ivan Yurchenko, Jun Rao, and Andrew
> Schofield pertaining to KIP-1150. Here are the notes from that meeting:
>
> Ivan: What is the future state of Kafka in this area, in 5 years?
> Jun: Do we want something more cloud native? Yes, started with Tiered
> Storage. If there’s a better way, we should explore it. In the long term
> this will be useful
> Because Kafka is used so widely, we need to make sure everything we add is
> for the long term and for everyone, not just for a single company.
> When we add TS, it doesn’t just solve Uber’s use-case. We want something
> that’s high quality/lasts/maintainable, and can work with all existing
> capabilities.
> If both 1150 and 1176 proceed at the same time, it’s confusing. They
> overlap, but Diskless is more ambitious.
> If both KIPs are being seriously worked on, then we don’t really need both,
> because Diskless clearly is better. Having multiple will confuse people. It
> will duplicate some of the effort.
> If we want diskless ultimately, what is the short term strategy, to get
> some early wins first?
> Ivan: Andrew, do you want a more revolutionary approach?
> Andrew: Eventually the architecture will change substantially, it may not
> be necessary to put all of that bill onto Diskless at once.
> Greg: We all agree on having a high quality feature merged upstream, and
> supporting all APIs
> Jun: We should try and keep things simple, but there is some minimum
> complexity needed.
> When doing the short term changes (1176), it doesn’t really progress in
> changing to a more modern architecture.
> Greg: Was TS+Compaction the only feature miss we’ve had so far?
> Jun: The danger of only applying changes to some part of the API, you set
> the precedence that you only have to implement part of the API. Supporting
> the full API set should be a minimum requirement.
> Andrew: When we started Kraft, how much did we know the design?
> Jun: For Kraft we didn’t really know much about the migration, but the
> high-level was clear.
> Greg: Is 1150 votable in its current state?
> Jun: 1150 should promise to support all APIs. It doesn’t have to have all
> the details/apis/etc. KIP-500 didn’t have it.
> We do need some high-level design enough to give confidence that the
> promise is able to be fulfilled.
> Greg: Is the draft version in 1163 enough detail or is more needed?
> Jun: We need to agree on the core design, such as leaderless etc. And how
> the APIs will be supported.
> Greg: Okay we can include these things, and provide a sketch of how the
> other leader-based features operate.
> Jun: Yeah if at a high level the sketch appears to work, we can approve
> that functionality.
> Are you committed to doing the more involved and big project?
> Greg: Yes, we’re committed to the 1163 design and can’t really accept 1176.
> Jun: TS was slow because of Uber resourcing problems
> Greg: We’ll push internally for resources, and use the community sentiment
> to motivate Aiven.
> How far into the future should we look? What sort of scale?
> Jun: As long as there’s a path forward, and we’re not closing off future
> improvements, we can figure out how to handle a larger scale when it
> arises.
> Greg: Random replica placement is very harmful, can we recommend users to
> use an external tool like CruiseControl?
> Jun: Not everyone uses CruiseControl, we would probably need some solution
> for this out of the box
> Ivan: Should the Batch Coordinator be pluggable?
> Jun: Out-of-box experience should be good, good to allow other
> implementations
> Greg: But it could hurt Kafka feature/upgrade velocity when we wait for
> plugin providers to implement it
> Ivan: We imagined that maybe cloud hyperscalers could implement it with
> e.g. dynamodb
> Greg: Could we bake more details of the different providers into Kafka, or
> does it still make sense for it to be pluggable?
> Jun: Make it whatever is easiest to roll out and add new clients
> Andrew: What happens next? Do you want to get KIP-1150 voted?
> Ivan: The vote is already open, we’re not too pressed for time. We’ll go
> improve the 1163 design and communication.
> Is 1176 a competing design? Someone will ask.
> Jun: If we are seriously working on something more ambitious, yeah we
> shouldn’t do the stop-gap solution.
> It’s diverting review resources. If we can get the short term thing in 1yr
> but Diskless solution is 2y it makes sense to go for Diskless. If it’s 5yr,
> that’s different and maybe the stop-gap solution is needed.
> Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we
> explicitly exclude 1176?
> Andrew: Put your arms around the feature set you actually want, and use
> that to rule out 1176.
> Probably don’t need -1 votes, most likely KIPs just don’t receive votes.
> Ivan: Should we have sync meetings like tiered storage did?
> Jun: Satish posted meeting notes regularly, we should do the same.
>
> To summarize, we will be polishing the contents of 1150 & high level design
> in 1163 to prepare for a vote.
> We believe that the community should select the feature set of 1150 to
> fully eliminate producer cross-zone costs, and make the investment in a
> high quality Diskless Topics implementation rather than in stop-gap
> solutions.
>
> Thanks,
> Greg
>
> On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote:
>
> > This may be a tangent, but we needed to offload storage off of Kafka into
> > S3. We are keeping Kafka not as a source of truth, but as a mostly
> > ephemeral broker that can come and go as it pleases. Be that scaling or
> > outage. Disks can be destroyed and recreated at will, we still retain
> data
> > and use broker for just that, brokering messages. Not only that, we
> reduced
> > the requirement on the actual Kafka resources by reducing the size of a
> > payload via a claim check pattern. Maybe this is an anti–pattern, but it
> is
> > super fast and highly cost efficient. We reworked ProducerRequest to
> allow
> > plugins. We added a custom http plugin that submits every request via a
> > persisted connection to a microservice. Microservice stores the payload
> and
> > returns a tiny json metadata object,a claim check, that can be used to
> find
> > the actual data. Think of it as zipping the payload. This claim check
> > metadata traverses the pipelines with consumers using the urls in
> metadata
> > to pull what they need. Think unzipping. This allowed us to also pull
> ONLY
> > the data that we need in graphql like manner. So if you have a 100K json
> > payload and you need only a subsection, you can pull that by jmespath.
> When
> > you have multiple consumer groups yanking down huge payloads it is
> > cumbersome on the broker. When you have the same consumer groups yanking
> > down a claim check, and then going out of band directly to the source of
> > truth, the broker has some breathing room. Obviously our microservice
> does
> > not go directly to the cloud storage, as that would be too slow. It
> stores
> > the payload in high speed memory cache and returns asap. That memory is
> > eventually persisted into S3. The retrieval goest against the cache
> first,
> > then against the S3. Overall a rather cheappy and zippy solution. I tried
> > proposing the KIP for this, but there was no excitement. Check this out:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528
> >
> >
> > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]> wrote:
> > >
> > > Hi, Andrew,
> > >
> > > If we want to focus only on reducing cross-zone replication costs,
> there
> > is
> > > an alternative design in the KIP-1176 discussion thread that seems
> > simpler
> > > than the proposal here. I am copying the outline of that approach
> below.
> > >
> > > 1. A new leader is elected.
> > > 2. Leader maintains a first tiered offset, which is initialized to log
> > end
> > > offset.
> > > 3. Leader writes produced data from the client to local log.
> > > 4. Leader uploads produced data from all local logs as a combined
> object
> > > 5. Leader stores the metadata for the combined object in memory.
> > > 6. If a follower fetch request has an offset >= first tiered offset,
> the
> > > metadata for the corresponding combined object is returned. Otherwise,
> > the
> > > local data is returned.
> > > 7. Leader periodically advances first tiered offset.
> > >
> > > It's still a bit unnatural, but it could work.
> > >
> > > Hi, Ivan,
> > >
> > > Are you still committed to proceeding with the original design of
> > KIP-1150?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi,
> > >> I’ve been following KIP-1150 and friends for a while. I’m going to
> jump
> > >> into the discussions too.
> > >>
> > >> Looking back at Jack Vanlightly’s message, I am not quite so convinced
> > >> that it’s a kind of fork in the road. The primary aim of the effort is
> > to
> > >> reduce cross-zone replication costs so Apache Kafka is not
> prohibitively
> > >> expensive to use on cloud storage. I think it would be entirely viable
> > to
> > >> prioritise code reuse for an initial implementation of diskless
> topics,
> > and
> > >> we could still have a more cloud-native design in the future. It’s
> hard
> > to
> > >> predict what the community will prioritise in the future.
> > >>
> > >> Of the three major revisions, I’m in the rev3 camp. We can support
> > >> leaderless produce requests, first writing WAL segments into object
> > >> storage, and then using the regular partition leaders to sequence the
> > >> records. The active log segment for a diskless topic will initially
> > contain
> > >> batch coordinates rather than record batches. The batch coordinates
> can
> > be
> > >> resolved from WAL segments for consumers, and also in order to prepare
> > log
> > >> segments for uploading to tiered storage. Jun is probably correct that
> > we
> > >> need a more frequent object merging process than tiered storage
> > provides.
> > >> This is just the transition from write-optimised WAL segments to
> > >> read-optimised tiered segments, and all of the object storage-based
> > >> implementations of Kafka that I’m aware of do this rearrangement. But
> > >> perhaps this more frequent object merging is a pre-GA improvement,
> > rather
> > >> than a strict requirement for an initial implementation for early
> access
> > >> use.
> > >>
> > >> For zone-aligned share consumers, the share group assignor is intended
> > to
> > >> be rack-aware. Consumers should be assigned to partitions with leaders
> > in
> > >> their zone. The simple assignor is not rack-aware, but it easily could
> > be
> > >> or we could have a rack-aware assignor.
> > >>
> > >> Thanks,
> > >> Andrew
> > >>
> > >>
> > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]> wrote:
> > >>>
> > >>> Hi, Ivan,
> > >>>
> > >>> Thanks for the reply.
> > >>>
> > >>> "As I understand, you’re speaking about locally materialized
> segments.
> > >> They
> > >>> will indeed consume some IOPS. See them as a cache that could always
> be
> > >>> restored from the remote storage. While it’s not ideal, it's still OK
> > to
> > >>> lose data in them due to a machine crash, for example. Because of
> this,
> > >> we
> > >>> can avoid explicit flushing on local materialized segments at all and
> > let
> > >>> the file system and page cache figure out when to flush optimally.
> This
> > >>> would not eliminate the extra IOPS, but should reduce it
> dramatically,
> > >>> depending on throughput for each partition. We, of course, continue
> > >>> flushing the metadata segments as before."
> > >>>
> > >>> If we have a mix of classic and diskless topics on the same broker,
> > it's
> > >>> important that the classic topics' data is flushed to disk as quickly
> > as
> > >>> possible. To achieve this, users typically set dirty_expire_centisecs
> > in
> > >>> the kernel based on the number of available disk IOPS. Once you set
> > this
> > >>> number, it applies to all dirty files, including the cached data in
> > >>> diskless topics. So, if there are more files actively accumulating
> > data,
> > >>> the flush frequency and therefore RPO is reduced for classic topics.
> > >>>
> > >>> "We should have mentioned this explicitly, but this step, in fact,
> > >> remains
> > >>> in the form of segments offloading to tiered storage. When we
> assemble
> > a
> > >>> segment and hand it over to RemoteLogManager, we’re effectively doing
> > >>> metadata compaction: replacing a big number of pieces of metadata
> about
> > >>> individual batches with a single record in __remote_log_metadata."
> > >>>
> > >>> The object merging in tier storage typically only kicks in after a
> few
> > >>> hours. The impact is (1) the amount of accumulated metadata is still
> > >> quite
> > >>> large; (2) there are many small objects, leading to poor read
> > >> performance.
> > >>> I think we need a more frequent object merging process than tier
> > storage
> > >>> provides.
> > >>>
> > >>> Jun
> > >>>
> > >>>
> > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <[email protected]>
> > wrote:
> > >>>
> > >>>> Hello Jack, Jun, Luke, and all!
> > >>>>
> > >>>> Thank you for your messages.
> > >>>>
> > >>>> Let me first address some of Jun’s comments.
> > >>>>
> > >>>>> First, it degrades the durability.
> > >>>>> For each partition, now there are two files being actively written
> > at a
> > >>>>> given point of time, one for the data and another for the metadata.
> > >>>>> Flushing each file requires a separate IO. If the disk has 1K IOPS
> > and
> > >> we
> > >>>>> have 5K partitions in a broker, currently we can afford to flush
> each
> > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we
> > double
> > >>>> the
> > >>>>> number of files per partition, we can only flush each partition
> every
> > >> 10
> > >>>>> seconds, which makes RPO twice as bad.
> > >>>>
> > >>>> As I understand, you’re speaking about locally materialized
> segments.
> > >> They
> > >>>> will indeed consume some IOPS. See them as a cache that could always
> > be
> > >>>> restored from the remote storage. While it’s not ideal, it's still
> OK
> > to
> > >>>> lose data in them due to a machine crash, for example. Because of
> > this,
> > >> we
> > >>>> can avoid explicit flushing on local materialized segments at all
> and
> > >> let
> > >>>> the file system and page cache figure out when to flush optimally.
> > This
> > >>>> would not eliminate the extra IOPS, but should reduce it
> dramatically,
> > >>>> depending on throughput for each partition. We, of course, continue
> > >>>> flushing the metadata segments as before.
> > >>>>
> > >>>> It’s worth making a note on caching. I think nobody will disagree
> that
> > >>>> doing direct reads from remote storage every time a batch is
> requested
> > >> by a
> > >>>> consumer will not be practical neither from the performance nor from
> > the
> > >>>> economy point of view. We need a way to keep the number of GET
> > requests
> > >>>> down. There are multiple options, for example:
> > >>>> 1. Rack-aware distributed in-memory caching.
> > >>>> 2. Local in-memory caching. Comes with less network chattiness and
> > >> works
> > >>>> well if we have more or less stable brokers to consume from.
> > >>>> 3. Materialization of diskless logs on local disk. Way lower impact
> on
> > >>>> RAM and also requires stable brokers for consumption (using just
> > >> assigned
> > >>>> replicas will probably work well).
> > >>>>
> > >>>> Materialization is one of possible options, but we can choose
> another
> > >> one.
> > >>>> However, we will have this dilemma regardless of whether we have an
> > >>>> explicit coordinator or we go “coordinator-less”.
> > >>>>
> > >>>>> Second, if we ever need this
> > >>>>> metadata somewhere else, say in the WAL file manager, the consumer
> > >> needs
> > >>>> to
> > >>>>> subscribe to every partition in the cluster, which is inefficient.
> > The
> > >>>>> actual benefit of this approach is also questionable. On the
> surface,
> > >> it
> > >>>>> might seem that we could reduce the number of lines that need to be
> > >>>> changed
> > >>>>> for this KIP. However, the changes are quite intrusive to the
> classic
> > >>>>> partition's code path and will probably make the code base harder
> to
> > >>>>> maintain in the long run. I like the original approach based on the
> > >> batch
> > >>>>> coordinator much better than this one. We could probably refactor
> the
> > >>>>> producer state code so that it could be reused in the batch
> > >> coordinator.
> > >>>>
> > >>>> It’s hard to disagree with this. The explicit coordinator is more a
> > side
> > >>>> thing, while coordinator-less approach is more about extending
> > >>>> ReplicaManager, UnifiedLog and others substantially.
> > >>>>
> > >>>>> Thanks for addressing the concerns on the number of RPCs in the
> > produce
> > >>>>> path. I agree that with the metadata crafting mechanism, we could
> > >>>> mitigate
> > >>>>> the PRC problem. However, since we now require the metadata to be
> > >>>>> collocated with the data on the same set of brokers, it's weird
> that
> > >> they
> > >>>>> are now managed by different mechanisms. The data assignment now
> uses
> > >> the
> > >>>>> metadata crafting mechanism, but the metadata is stored in the
> > classic
> > >>>>> partition using its own assignment strategy. It will be complicated
> > to
> > >>>> keep
> > >>>>> them collocated.
> > >>>>
> > >>>> I would like to note that the metadata crafting is needed only to
> tell
> > >>>> producers which brokers they should send Produce requests to, but
> data
> > >> (as
> > >>>> in “locally materialized log”) is located on partition replicas,
> i.e.
> > >>>> automatically co-located with metadata.
> > >>>>
> > >>>> As a side note, it would probably be better that instead of
> implicitly
> > >>>> crafting partition metadata, we extend the metadata protocol so that
> > for
> > >>>> diskless partitions we return not only the leader and replicas, but
> > also
> > >>>> some “recommended produce brokers”, selected for optimal performance
> > and
> > >>>> costs. Producers will pick ones in their racks.
> > >>>>
> > >>>>> I am also concerned about the removal of the object
> > compaction/merging
> > >>>>> step.
> > >>>>
> > >>>> We should have mentioned this explicitly, but this step, in fact,
> > >> remains
> > >>>> in the form of segments offloading to tiered storage. When we
> > assemble a
> > >>>> segment and hand it over to RemoteLogManager, we’re effectively
> doing
> > >>>> metadata compaction: replacing a big number of pieces of metadata
> > about
> > >>>> individual batches with a single record in __remote_log_metadata.
> > >>>>
> > >>>> We could create a Diskless-specific merging mechanism instead if
> > needed.
> > >>>> It’s rather easy with the explicit coordinator approach. With the
> > >>>> coordinator-less approach, this would probably be a bit more tricky
> > >>>> (rewriting the tail of the log by the leader + replicating this
> change
> > >>>> reliably).
> > >>>>
> > >>>>> I see a tendency toward primarily optimizing for the fewest code
> > >> changes
> > >>>> in
> > >>>>> the KIP. Instead, our primary goal should be a clean design that
> can
> > >> last
> > >>>>> for the long term.
> > >>>>
> > >>>> Yes, totally agree.
> > >>>>
> > >>>>
> > >>>>
> > >>>> Luke,
> > >>>>> I'm wondering if the complexity of designing txn and queue is
> because
> > >> of
> > >>>>> leaderless cluster, do you think it will be simpler if we only
> focus
> > on
> > >>>> the
> > >>>>> "diskless" design to handle object compaction/merging to/from the
> > >> remote
> > >>>>> storage to save the cross-AZ cost?
> > >>>>
> > >>>> After some evolution of the original proposal, leaderless is now
> > >> limited.
> > >>>> We only need to be able to accept Produce requests on more than one
> > >> broker
> > >>>> to eliminate the cross-AZ costs for producers. Do I get it right
> that
> > >> you
> > >>>> propose to get rid of this? Or do I misunderstand?
> > >>>>
> > >>>>
> > >>>>
> > >>>> Let’s now look at this problem from a higher level, as Jack
> proposed.
> > As
> > >>>> it was said, the big choice we need to make is whether we 1) create
> an
> > >>>> explicit batch coordinator; or 2) go for the coordinator-less
> > approach,
> > >>>> where each diskless partition is managed by its leader as in classic
> > >> topics.
> > >>>>
> > >>>> If we try to compare the two approaches:
> > >>>>
> > >>>> Pluggability:
> > >>>> - Explicit coordinator: Possible. For example, some setups may
> benefit
> > >>>> from batch metadata being stored in a cloud database (such as AWS
> > >> DynamoDB
> > >>>> or GCP Spanner).
> > >>>> - Coordinator-less: Impossible.
> > >>>>
> > >>>> Scalability and fault tolerance:
> > >>>> - Explicit coordinator: Depends on the implementation and it’s also
> > >>>> necessary to actively work for it.
> > >>>> - Coordinator-less: Closer to classic Kafka topics. Scaling is done
> by
> > >>>> partition placement, partitions could fail independently.
> > >>>>
> > >>>> Separation of concerns:
> > >>>> - Explicit coordinator: Very good. Diskless remains more independent
> > >> from
> > >>>> classic topics in terms of code and workflows. For example, the
> > >>>> above-mentioned non-tiered storage metadata compaction mechanism
> could
> > >> be
> > >>>> relatively simply implemented with it. As a flip side of this, some
> > >>>> workflows (e.g. transactions) will have to be adapted.
> > >>>> - Coordinator-less: Less so. It leans to the opposite: bringing
> > diskless
> > >>>> closer to classic topics. Some code paths and workflows could be
> more
> > >>>> straightforwardly reused, but they will inevitably have to be
> adapted
> > to
> > >>>> accommodate both topic types as also discussed.
> > >>>>
> > >>>> Cloud-nativeness. This is a vague concept, also related to the
> > previous,
> > >>>> but let’s try:
> > >>>> - Explicit coordinator: Storing and processing metadata separately
> > makes
> > >>>> it easier for brokers to take different roles, be purely stateless
> if
> > >>>> needed, etc.
> > >>>> - Coordinator-less: Less so. Something could be achieved with
> creative
> > >>>> partition placement, but not much.
> > >>>>
> > >>>> Both seem to have their pros and cons. However, answering Jack’s
> > >> question,
> > >>>> the explicit coordinator approach may indeed lead to a more flexible
> > >> design.
> > >>>>
> > >>>>
> > >>>> The purpose of this deviation in the discussion was to receive a
> > >>>> preliminary community evaluation of the coordinator-less approach
> > >> without
> > >>>> taking on the task of writing a separate KIP and fitting it in the
> > >> system
> > >>>> of KIP-1150 and its children. We’re open to stopping it and getting
> > >> back to
> > >>>> working out the coordinator design if the community doesn’t favor
> the
> > >>>> proposed approach.
> > >>>>
> > >>>> Best,
> > >>>> Ivan and Diskless team
> > >>>>
> > >>>> On Mon, Oct 20, 2025, at 05:58, Luke Chen wrote:
> > >>>>> Hi Ivan,
> > >>>>>
> > >>>>> As Jun pointed out, the updated design seems to have some
> > shortcomings
> > >>>>> although it simplifies the implementation.
> > >>>>>
> > >>>>> I'm wondering if the complexity of designing txn and queue is
> because
> > >> of
> > >>>>> leaderless cluster, do you think it will be simpler if we only
> focus
> > on
> > >>>> the
> > >>>>> "diskless" design to handle object compaction/merging to/from the
> > >> remote
> > >>>>> storage to save the cross-AZ cost?
> > >>>>>
> > >>>>>
> > >>>>> Thank you,
> > >>>>> Luke
> > >>>>>
> > >>>>> On Sat, Oct 18, 2025 at 5:22 AM Jun Rao <[email protected]>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Hi, Ivan,
> > >>>>>>
> > >>>>>> Thanks for the explanation.
> > >>>>>>
> > >>>>>> "we write the reference to the WAL file with the batch data"
> > >>>>>>
> > >>>>>> I understand the approach now, but I think it is a hacky one.
> There
> > >> are
> > >>>>>> multiple short comings with this design. First, it degrades the
> > >>>> durability.
> > >>>>>> For each partition, now there are two files being actively written
> > at
> > >> a
> > >>>>>> given point of time, one for the data and another for the
> metadata.
> > >>>>>> Flushing each file requires a separate IO. If the disk has 1K IOPS
> > and
> > >>>> we
> > >>>>>> have 5K partitions in a broker, currently we can afford to flush
> > each
> > >>>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we
> > double
> > >>>> the
> > >>>>>> number of files per partition, we can only flush each partition
> > every
> > >>>> 10
> > >>>>>> seconds, which makes RPO twice as bad. Second, if we ever need
> this
> > >>>>>> metadata somewhere else, say in the WAL file manager, the consumer
> > >>>> needs to
> > >>>>>> subscribe to every partition in the cluster, which is inefficient.
> > The
> > >>>>>> actual benefit of this approach is also questionable. On the
> > surface,
> > >>>> it
> > >>>>>> might seem that we could reduce the number of lines that need to
> be
> > >>>> changed
> > >>>>>> for this KIP. However, the changes are quite intrusive to the
> > classic
> > >>>>>> partition's code path and will probably make the code base harder
> to
> > >>>>>> maintain in the long run. I like the original approach based on
> the
> > >>>> batch
> > >>>>>> coordinator much better than this one. We could probably refactor
> > the
> > >>>>>> producer state code so that it could be reused in the batch
> > >>>> coordinator.
> > >>>>>>
> > >>>>>> Thanks for addressing the concerns on the number of RPCs in the
> > >> produce
> > >>>>>> path. I agree that with the metadata crafting mechanism, we could
> > >>>> mitigate
> > >>>>>> the PRC problem. However, since we now require the metadata to be
> > >>>>>> collocated with the data on the same set of brokers, it's weird
> that
> > >>>> they
> > >>>>>> are now managed by different mechanisms. The data assignment now
> > uses
> > >>>> the
> > >>>>>> metadata crafting mechanism, but the metadata is stored in the
> > classic
> > >>>>>> partition using its own assignment strategy. It will be
> complicated
> > to
> > >>>> keep
> > >>>>>> them collocated.
> > >>>>>>
> > >>>>>> I am also concerned about the removal of the object
> > compaction/merging
> > >>>>>> step. My first concern is on the amount of metadata that need to
> be
> > >>>> kept.
> > >>>>>> Without object compcation, the metadata generated in the produce
> > path
> > >>>> can
> > >>>>>> only be deleted after remote tiering kicks in. Let's say for every
> > >>>> 250ms we
> > >>>>>> produce 100 byte of metadata per partition. Let's say remoting
> > tiering
> > >>>>>> kicks in after 5 hours. In a cluster with 100K partitions, we need
> > to
> > >>>> keep
> > >>>>>> about 100 * (1 / 0.2)  * 5 * 3600 * 100K = 720 GB metadata, quite
> > >>>>>> signficant. A second concern is on performance. Every time we need
> > to
> > >>>>>> rebuild the caching data, we need to read a bunch of small objects
> > >>>> from S3,
> > >>>>>> slowing down the building process. If a consumer happens to need
> > such
> > >>>> data,
> > >>>>>> it could slow down the application.
> > >>>>>>
> > >>>>>> I see a tendency toward primarily optimizing for the fewest code
> > >>>> changes in
> > >>>>>> the KIP. Instead, our primary goal should be a clean design that
> can
> > >>>> last
> > >>>>>> for the long term.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Jun
> > >>>>>>
> > >>>>>> On Tue, Oct 14, 2025 at 11:02 AM Ivan Yurchenko <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Jun,
> > >>>>>>>
> > >>>>>>> Thank you for your message. I’m sorry that I failed to clearly
> > >>>> explain
> > >>>>>> the
> > >>>>>>> idea. Let me try to fix this.
> > >>>>>>>
> > >>>>>>>> Does each partition now have a metadata partition and a separate
> > >>>> data
> > >>>>>>>> partition? If so, I am concerned that it essentially doubles the
> > >>>> number
> > >>>>>>> of
> > >>>>>>>> partitions, which impacts the number of open file descriptors
> and
> > >>>> the
> > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > separate
> > >>>>>>>> partition just to store the metadata. It's as if we are creating
> > an
> > >>>>>>>> internal topic with an unbounded number of partitions.
> > >>>>>>>
> > >>>>>>> No. There will be only one physical partition per diskless
> > >>>> partition. Let
> > >>>>>>> me explain this with an example. Let’s say we have a diskless
> > >>>> partition
> > >>>>>>> topic-0. It has three replicas 0, 1, 2; 0 is the leader. We
> produce
> > >>>> some
> > >>>>>>> batches to this partition. The content of the segment file will
> be
> > >>>>>>> something like this (for each batch):
> > >>>>>>>
> > >>>>>>> BaseOffset: 00000000000000000000 (like in classic)
> > >>>>>>> Length: 123456 (like in classic)
> > >>>>>>> PartitionLeaderEpoch: like in classic
> > >>>>>>> Magic: like in classic
> > >>>>>>> CRC: like in classic
> > >>>>>>> Attributes: like in classic
> > >>>>>>> LastOffsetDelta: like in classic
> > >>>>>>> BaseTimestamp: like in classic
> > >>>>>>> MaxTimestamp: like in classic
> > >>>>>>> ProducerId: like in classic
> > >>>>>>> ProducerEpoch: like in classic
> > >>>>>>> BaseSequence: like in classic
> > >>>>>>> RecordsCount: like in classic
> > >>>>>>> Records:
> > >>>>>>> path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a; byte
> offset
> > >>>>>>> 123456
> > >>>>>>>
> > >>>>>>> It looks very much like classic log entries. The only difference
> is
> > >>>> that
> > >>>>>>> instead of writing real Records, we write the reference to the
> WAL
> > >>>> file
> > >>>>>>> with the batch data (I guess we need only the name and the byte
> > >>>> offset,
> > >>>>>>> because the byte length is the standard field above). Otherwise,
> > >>>> it’s a
> > >>>>>>> normal Kafka log with the leader and replicas.
> > >>>>>>>
> > >>>>>>> So we have as many partitions for diskless as for classic. As of
> > open
> > >>>>>> file
> > >>>>>>> descriptors, let’s proceed to the following:
> > >>>>>>>
> > >>>>>>>> Are the metadata and
> > >>>>>>>> the data for the same partition always collocated on the same
> > >>>> broker?
> > >>>>>> If
> > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > >>>>>>>
> > >>>>>>> The source of truth for the data is still in WAL files on object
> > >>>> storage.
> > >>>>>>> The source of truth for the metadata is in segment files on the
> > >>>> brokers
> > >>>>>> in
> > >>>>>>> the replica set. Two new mechanisms are planned, both independent
> > of
> > >>>> this
> > >>>>>>> new proposal, but I want to present them to give the idea that
> > only a
> > >>>>>>> limited amount of data files will be operated locally:
> > >>>>>>> - We want to assemble batches into segment files and offload them
> > to
> > >>>>>>> tiered storage in order to prevent the unbounded growth of batch
> > >>>>>> metadata.
> > >>>>>>> For this, we need to open only  a few file descriptors (for the
> > >>>> segment
> > >>>>>>> file itself + the necessary indexes) before the segment is fully
> > >>>> written
> > >>>>>>> and handed over to RemoteLogManager.
> > >>>>>>> - We want to assemble local segment files for caching purposes as
> > >>>> well,
> > >>>>>>> i.e. to speed up fetching. This will not materialize the full
> > >>>> content of
> > >>>>>>> the log, but only the hot set according to some policy (or
> > >>>> configurable
> > >>>>>>> policies), i.e. the number of segments and file descriptors will
> > >>>> also be
> > >>>>>>> limited.
> > >>>>>>>
> > >>>>>>>> The number of RPCs in the produce path is significantly higher.
> > For
> > >>>>>>>> example, if a produce request has 100 partitions, in a cluster
> > >>>> with 100
> > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > requests.
> > >>>>>> This
> > >>>>>>>> will significantly increase the request rate.
> > >>>>>>>
> > >>>>>>> This is a valid concern that we considered, but this issue can be
> > >>>>>>> mitigated. I’ll try to explain the approach.
> > >>>>>>> The situation with a single broker is trivial: all the commit
> > >>>> requests go
> > >>>>>>> from the broker to itself.
> > >>>>>>> Let’s scale this to a multi-broker cluster, but located in the
> > single
> > >>>>>> rack
> > >>>>>>> (AZ). Any broker can accept Produce requests for diskless
> > >>>> partitions, but
> > >>>>>>> we can tell producers (through metadata) to always send Produce
> > >>>> requests
> > >>>>>> to
> > >>>>>>> leaders. For example, broker 0 hosts the leader replicas for
> > diskless
> > >>>>>>> partitions t1-0, t2-1, t3-0. It will receive diskless Produce
> > >>>> requests
> > >>>>>> for
> > >>>>>>> these partitions in various combinations, but only for them.
> > >>>>>>>
> > >>>>>>>                   Broker 0
> > >>>>>>>             +-----------------+
> > >>>>>>>             |    t1-0         |
> > >>>>>>>             |    t2-1 <--------------------+
> > >>>>>>>             |    t3-0         |            |
> > >>>>>>> produce      | +-------------+ |            |
> > >>>>>>> requests     | |  diskless   | |            |
> > >>>>>>> --------------->|   produce   +--------------+
> > >>>>>>> for these    | | WAL buffer  | |    commit requests
> > >>>>>>> partitions   | +-------------+ |    for these partitions
> > >>>>>>>             |                 |
> > >>>>>>>             +-----------------+
> > >>>>>>>
> > >>>>>>> The same applies for other brokers in this cluster. Effectively,
> > each
> > >>>>>>> broker will commit only to itself, which effectively means 1
> commit
> > >>>>>> request
> > >>>>>>> per WAL buffer (this may be 0 physical network calls, if we wish,
> > >>>> just a
> > >>>>>>> local function call).
> > >>>>>>>
> > >>>>>>> Now let’s scale this to multiple racks (AZs). Obviously, we
> cannot
> > >>>> always
> > >>>>>>> send Produce requests to the designated leaders of diskless
> > >>>> partitions:
> > >>>>>>> this would mean inter-AZ network traffic, which we would like to
> > >>>> avoid.
> > >>>>>> To
> > >>>>>>> avoid it, we say that every broker has a “diskless produce
> > >>>>>> representative”
> > >>>>>>> in every AZ. If we continue our example: when a Produce request
> for
> > >>>> t1-0,
> > >>>>>>> t2-1, or t3-0 comes from a producer in AZ 0, it lands on broker 0
> > >>>> (in the
> > >>>>>>> broker’s AZ the representative is the broker itself). However, if
> > it
> > >>>>>> comes
> > >>>>>>> from AZ 1, it lands on broker 1; in AZ 2, it’s broker 2.
> > >>>>>>>
> > >>>>>>> |produce requests         |produce requests        |produce
> > >>>> requests
> > >>>>>>> |for t1-0, t2-1, t3-0     |for t1-0, t2-1, t3-0    |for t1-0,
> t2-1,
> > >>>>>> t3-0
> > >>>>>>> |from AZ 0                |from AZ 1               |from AZ 2
> > >>>>>>> v                         v                        v
> > >>>>>>> Broker 0 (AZ 0)        Broker 1 (AZ 1)        Broker 2 (AZ 2)
> > >>>>>>> +---------------+      +---------------+      +---------------+
> > >>>>>>> |     t1-0      |      |               |      |               |
> > >>>>>>> |     t2-1      |      |               |      |               |
> > >>>>>>> |     t3-0      |      |               |      |               |
> > >>>>>>> +---------------+      +--------+------+      +--------+------+
> > >>>>>>>    ^     ^                    |                      |
> > >>>>>>>    |     +--------------------+                      |
> > >>>>>>>    |     commit requests for these partitions        |
> > >>>>>>>    |                                                 |
> > >>>>>>>    +-------------------------------------------------+
> > >>>>>>>          commit requests for these partitions
> > >>>>>>>
> > >>>>>>> All the partitions that broker 0 is the leader of will be
> > >>>> “represented”
> > >>>>>> by
> > >>>>>>> brokers 1 and 2 in their AZs.
> > >>>>>>>
> > >>>>>>> Of course, this relationship goes both ways between AZs (not
> > >>>> necessarily
> > >>>>>>> between the same brokers). It means that provided the cluster is
> > >>>> balanced
> > >>>>>>> by the number of brokers per AZ, each broker will represent
> > >>>>>> (number_of_azs
> > >>>>>>> - 1) other brokers. This will result in the situation that for
> the
> > >>>>>> majority
> > >>>>>>> of commits, each broker will do up to (number_of_azs - 1) network
> > >>>> commit
> > >>>>>>> requests (plus one local). Cloud regions tend to have 3 AZs, very
> > >>>> rarely
> > >>>>>>> more. That means, brokers will be doing up to 2 network commit
> > >>>> requests
> > >>>>>> per
> > >>>>>>> WAL file.
> > >>>>>>>
> > >>>>>>> There are the following exceptions:
> > >>>>>>> 1. Broker count imbalance between AZs. For example, when we have
> 2
> > >>>> AZs
> > >>>>>> and
> > >>>>>>> one has three brokers and another AZ has one. This one broker
> will
> > do
> > >>>>>>> between 1 and 3 commit requests per WAL file. This is not an
> > extreme
> > >>>>>>> amplification. Such an imbalance is not healthy in most practical
> > >>>> setups
> > >>>>>>> and should be avoided anyway.
> > >>>>>>> 2. Leadership changes and metadata propagation period. When the
> > >>>> partition
> > >>>>>>> t3-0 is relocated from broker 0 to some broker 3, the producers
> > will
> > >>>> not
> > >>>>>>> know this immediately (unless we want to be strict and respond
> with
> > >>>>>>> NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will come
> > >>>> together
> > >>>>>> in a
> > >>>>>>> WAL buffer on broker 2, it will have to send two commit requests:
> > to
> > >>>>>> broker
> > >>>>>>> 0 to commit t1-0 and t2-1, and to broker 3 to commit t3-0. This
> > >>>> situation
> > >>>>>>> is not permanent and as producers update the cluster metadata, it
> > >>>> will be
> > >>>>>>> resolved.
> > >>>>>>>
> > >>>>>>> This all could be built with the metadata crafting mechanism only
> > >>>> (which
> > >>>>>>> is anyway needed for Diskless in one way or another to direct
> > >>>> producers
> > >>>>>> and
> > >>>>>>> consumers where we need to avoid inter-AZ traffic), just with the
> > >>>> right
> > >>>>>>> policy for it (for example, some deterministic hash-based
> formula).
> > >>>> I.e.
> > >>>>>> no
> > >>>>>>> explicit support for “produce representative” or anything like
> this
> > >>>> is
> > >>>>>>> needed on the cluster level, in KRaft, etc.
> > >>>>>>>
> > >>>>>>>> The same WAL file metadata is now duplicated into two places,
> > >>>> partition
> > >>>>>>>> leader and WAL File Manager. Which one is the source of truth,
> and
> > >>>> how
> > >>>>>> do
> > >>>>>>>> we maintain consistency between the two places?
> > >>>>>>>
> > >>>>>>> We do only two operations on WAL files that span multiple
> diskless
> > >>>>>>> partitions: committing and deleting. Commits can be done
> > >>>> independently as
> > >>>>>>> described above. But deletes are different, because when a file
> is
> > >>>>>> deleted,
> > >>>>>>> this affects all the partitions that still have alive batches in
> > this
> > >>>>>> file
> > >>>>>>> (if any).
> > >>>>>>>
> > >>>>>>> The WAL file manager is a necessary point of coordination to
> delete
> > >>>> WAL
> > >>>>>>> files safely. We can say it is the source of truth about files
> > >>>>>> themselves,
> > >>>>>>> while the partition leaders and their logs hold the truth about
> > >>>> whether a
> > >>>>>>> particular file contains live batches of this particular
> partition.
> > >>>>>>>
> > >>>>>>> The file manager will do this important task: be able to say for
> > sure
> > >>>>>> that
> > >>>>>>> a file does not contain any live batch of any existing partition.
> > For
> > >>>>>> this,
> > >>>>>>> it will have to periodically check against the partition leaders.
> > >>>>>>> Considering that batch deletion is irreversible, when we declare
> a
> > >>>> file
> > >>>>>>> “empty”, this is guaranteed to be and stay so.
> > >>>>>>>
> > >>>>>>> The file manager has to know about files being committed to start
> > >>>> track
> > >>>>>>> them and periodically check if they are empty. We can consider
> > >>>> various
> > >>>>>> ways
> > >>>>>>> to achieve this:
> > >>>>>>> 1. As was proposed in my previous message: best effort commit by
> > >>>> brokers
> > >>>>>> +
> > >>>>>>> periodic prefix scans of object storage to detect files that went
> > >>>> below
> > >>>>>> the
> > >>>>>>> radar due to network issue or the file manager temporary
> > >>>> unavailability.
> > >>>>>>> We’re speaking about listing the file names only and opening only
> > >>>>>>> previously unknown files in order to find the partitions involved
> > >>>> with
> > >>>>>> them.
> > >>>>>>> 2. Only do scans without explicit commit, i.e. fill the list of
> > files
> > >>>>>>> fully asynchronously and in the background. This may be not ideal
> > >>>> due to
> > >>>>>>> costs and performance of scanning tons of files. However, the
> > number
> > >>>> of
> > >>>>>>> live WAL files should be limited due to tiered storage
> offloading +
> > >>>> we
> > >>>>>> can
> > >>>>>>> optimize this if we give files some global soft order in their
> > names.
> > >>>>>>>
> > >>>>>>>> I am not sure how this design simplifies the implementation. The
> > >>>>>> existing
> > >>>>>>>> producer/replication code can't be simply reused. Adjusting both
> > >>>> the
> > >>>>>>> write
> > >>>>>>>> path in the leader and the replication path in the follower to
> > >>>>>> understand
> > >>>>>>>> batch-header only data is quite intrusive to the existing logic.
> > >>>>>>>
> > >>>>>>> It is true that we’ll have to change LocalLog and UnifiedLog in
> > >>>> order to
> > >>>>>>> support these changes. However, it seems that idempotence,
> > >>>> transactions,
> > >>>>>>> queues, tiered storage will have to be changed less than with the
> > >>>>>> original
> > >>>>>>> design. This is because the partition leader state would remain
> in
> > >>>> the
> > >>>>>> same
> > >>>>>>> place (on brokers) and existing workflows that involve it would
> > have
> > >>>> to
> > >>>>>> be
> > >>>>>>> changed less compared to the situation where we globalize the
> > >>>> partition
> > >>>>>>> leader state in the batch coordinator. I admit this is hard to
> make
> > >>>>>>> convincing without both real implementations to hand :)
> > >>>>>>>
> > >>>>>>>> I am also
> > >>>>>>>> not sure how this enables seamless switching the topic modes
> > >>>> between
> > >>>>>>>> diskless and classic. Could you provide more details on those?
> > >>>>>>>
> > >>>>>>> Let’s consider the scenario of turning a classic topic into
> > >>>> diskless. The
> > >>>>>>> user sets diskless.enabled=true, the leader receives this
> metadata
> > >>>> update
> > >>>>>>> and does the following:
> > >>>>>>> 1. Stop accepting normal append writes.
> > >>>>>>> 2. Close the current active segment.
> > >>>>>>> 3. Start a new segment that will be written in the diskless
> format
> > >>>> (i.e.
> > >>>>>>> without data).
> > >>>>>>> 4. Start accepting diskless commits.
> > >>>>>>>
> > >>>>>>> Since it’s the same log, the followers will know about that
> switch
> > >>>>>>> consistently. They will finish replicating the classic segments
> and
> > >>>> start
> > >>>>>>> replicating the diskless ones. They will always know where each
> > >>>> batch is
> > >>>>>>> located (either inside a classic segment or referenced by a
> > diskless
> > >>>>>> one).
> > >>>>>>> Switching back should be similar.
> > >>>>>>>
> > >>>>>>> Doing this with the coordinator is possible, but has some
> caveats.
> > >>>> The
> > >>>>>>> leader must do the following:
> > >>>>>>> 1. Stop accepting normal append writes.
> > >>>>>>> 2. Close the current active segment.
> > >>>>>>> 3. Write a special control segment to persist and replicate the
> > fact
> > >>>> that
> > >>>>>>> from offset N the partition is now in the diskless mode.
> > >>>>>>> 4. Inform the coordinator about the first offset N of the
> “diskless
> > >>>> era”.
> > >>>>>>> 5. Inform the controller quorum that the transition has finished
> > and
> > >>>> that
> > >>>>>>> brokers now can process diskless writes for this partition.
> > >>>>>>> This could fail at some points, so this will probably require
> some
> > >>>>>>> explicit state machine with replication either in the partition
> log
> > >>>> or in
> > >>>>>>> KRaft.
> > >>>>>>>
> > >>>>>>> It seems that the coordinator-less approach makes this simpler
> > >>>> because
> > >>>>>> the
> > >>>>>>> “coordinator” for the partition and the partition leader are the
> > >>>> same and
> > >>>>>>> they store the partition metadata in the same log, too. While in
> > the
> > >>>>>>> coordinator approach we have to perform some kind of a
> distributed
> > >>>> commit
> > >>>>>>> to handover metadata management from the classic partition leader
> > to
> > >>>> the
> > >>>>>>> batch coordinator.
> > >>>>>>>
> > >>>>>>> I hope these explanations help to clarify the idea. Please let me
> > >>>> know if
> > >>>>>>> I should go deeper anywhere.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Ivan and the Diskless team
> > >>>>>>>
> > >>>>>>> On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote:
> > >>>>>>>> Hi, Ivan,
> > >>>>>>>>
> > >>>>>>>> Thanks for the update.
> > >>>>>>>>
> > >>>>>>>> I am not sure that I fully understand the new design, but it
> seems
> > >>>> less
> > >>>>>>>> clean than before.
> > >>>>>>>>
> > >>>>>>>> Does each partition now have a metadata partition and a separate
> > >>>> data
> > >>>>>>>> partition? If so, I am concerned that it essentially doubles the
> > >>>> number
> > >>>>>>> of
> > >>>>>>>> partitions, which impacts the number of open file descriptors
> and
> > >>>> the
> > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > separate
> > >>>>>>>> partition just to store the metadata. It's as if we are creating
> > an
> > >>>>>>>> internal topic with an unbounded number of partitions. Are the
> > >>>> metadata
> > >>>>>>> and
> > >>>>>>>> the data for the same partition always collocated on the same
> > >>>> broker?
> > >>>>>> If
> > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > >>>>>>>>
> > >>>>>>>> The number of RPCs in the produce path is significantly higher.
> > For
> > >>>>>>>> example, if a produce request has 100 partitions, in a cluster
> > >>>> with 100
> > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > requests.
> > >>>>>> This
> > >>>>>>>> will significantly increase the request rate.
> > >>>>>>>>
> > >>>>>>>> The same WAL file metadata is now duplicated into two places,
> > >>>> partition
> > >>>>>>>> leader and WAL File Manager. Which one is the source of truth,
> and
> > >>>> how
> > >>>>>> do
> > >>>>>>>> we maintain consistency between the two places?
> > >>>>>>>>
> > >>>>>>>> I am not sure how this design simplifies the implementation. The
> > >>>>>> existing
> > >>>>>>>> producer/replication code can't be simply reused. Adjusting both
> > >>>> the
> > >>>>>>> write
> > >>>>>>>> path in the leader and the replication path in the follower to
> > >>>>>> understand
> > >>>>>>>> batch-header only data is quite intrusive to the existing
> logic. I
> > >>>> am
> > >>>>>>> also
> > >>>>>>>> not sure how this enables seamless switching the topic modes
> > >>>> between
> > >>>>>>>> diskless and classic. Could you provide more details on those?
> > >>>>>>>>
> > >>>>>>>> Jun
> > >>>>>>>>
> > >>>>>>>> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko <[email protected]>
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi dear Kafka community,
> > >>>>>>>>>
> > >>>>>>>>> In the initial Diskless proposal, we proposed to have a
> separate
> > >>>>>>>>> component, batch/diskless coordinator, whose role would be to
> > >>>>>> centrally
> > >>>>>>>>> manage the batch and WAL file metadata for diskless topics.
> This
> > >>>>>>> component
> > >>>>>>>>> drew many reasonable comments from the community about how it
> > >>>> would
> > >>>>>>> support
> > >>>>>>>>> various Kafka features (transactions, queues) and its
> > >>>> scalability.
> > >>>>>>> While we
> > >>>>>>>>> believe we have good answers to all the expressed concerns, we
> > >>>> took a
> > >>>>>>> step
> > >>>>>>>>> back and looked at the problem from a different perspective.
> > >>>>>>>>>
> > >>>>>>>>> We would like to propose an alternative Diskless design
> *without
> > >>>> a
> > >>>>>>>>> centralized coordinator*. We believe this approach has
> potential
> > >>>> and
> > >>>>>>>>> propose to discuss it as it may be more appealing to the
> > >>>> community.
> > >>>>>>>>>
> > >>>>>>>>> Let us explain the idea. Most of the complications with the
> > >>>> original
> > >>>>>>>>> Diskless approach come from one necessary architecture change:
> > >>>>>>> globalizing
> > >>>>>>>>> the local state of partition leader in the batch coordinator.
> > >>>> This
> > >>>>>>> causes
> > >>>>>>>>> deviations to the established workflows in various features
> like
> > >>>>>>> produce
> > >>>>>>>>> idempotence and transactions, queues, retention, etc. These
> > >>>>>> deviations
> > >>>>>>> need
> > >>>>>>>>> to be carefully considered, designed, and later implemented and
> > >>>>>>> tested. In
> > >>>>>>>>> the new approach we want to avoid this by making partition
> > >>>> leaders
> > >>>>>>> again
> > >>>>>>>>> responsible for managing their partitions, even in diskless
> > >>>> topics.
> > >>>>>>>>>
> > >>>>>>>>> In classic Kafka topics, batch data and metadata are blended
> > >>>> together
> > >>>>>>> in
> > >>>>>>>>> the one partition log. The crux of the Diskless idea is to
> > >>>> decouple
> > >>>>>>> them
> > >>>>>>>>> and move data to the remote storage, while keeping metadata
> > >>>> somewhere
> > >>>>>>> else.
> > >>>>>>>>> Using the central batch coordinator for managing batch metadata
> > >>>> is
> > >>>>>> one
> > >>>>>>> way,
> > >>>>>>>>> but not the only.
> > >>>>>>>>>
> > >>>>>>>>> Let’s now think about managing metadata for each user partition
> > >>>>>>>>> independently. Generally partitions are independent and don’t
> > >>>> share
> > >>>>>>>>> anything apart from that their data are mixed in WAL files. If
> we
> > >>>>>>> figure
> > >>>>>>>>> out how to commit and later delete WAL files safely, we will
> > >>>> achieve
> > >>>>>>> the
> > >>>>>>>>> necessary autonomy that allows us to get rid of the central
> batch
> > >>>>>>>>> coordinator. Instead, *each diskless user partition will be
> > >>>> managed
> > >>>>>> by
> > >>>>>>> its
> > >>>>>>>>> leader*, as in classic Kafka topics. Also like in classic
> > >>>> topics, the
> > >>>>>>>>> leader uses the partition log as the way to persist batch
> > >>>> metadata,
> > >>>>>>> i.e.
> > >>>>>>>>> the regular batch header + the information about how to find
> this
> > >>>>>>> batch on
> > >>>>>>>>> remote storage. In contrast to classic topics, batch data is in
> > >>>>>> remote
> > >>>>>>>>> storage.
> > >>>>>>>>>
> > >>>>>>>>> For clarity, let’s compare the three designs:
> > >>>>>>>>> • Classic topics:
> > >>>>>>>>>  • Data and metadata are co-located in the partition log.
> > >>>>>>>>>  • The partition log content: [Batch header (metadata)|Batch
> > >>>> data].
> > >>>>>>>>>  • The partition log is replicated to the followers.
> > >>>>>>>>>  • The replicas and leader have local state built from
> > >>>> metadata.
> > >>>>>>>>> • Original Diskless:
> > >>>>>>>>>  • Metadata is in the batch coordinator, data is on remote
> > >>>> storage.
> > >>>>>>>>>  • The partition state is global in the batch coordinator.
> > >>>>>>>>> • New Diskless:
> > >>>>>>>>>  • Metadata is in the partition log, data is on remote storage.
> > >>>>>>>>>  • Partition log content: [Batch header (metadata)|Batch
> > >>>>>> coordinates
> > >>>>>>> on
> > >>>>>>>>> remote storage].
> > >>>>>>>>>  • The partition log is replicated to the followers.
> > >>>>>>>>>  • The replicas and leader have local state built from
> > >>>> metadata.
> > >>>>>>>>>
> > >>>>>>>>> Let’s consider the produce path. Here’s the reminder of the
> > >>>> original
> > >>>>>>>>> Diskless design:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> The new approach could be depicted as the following:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> As you can see, the main difference is that now instead of a
> > >>>> single
> > >>>>>>> commit
> > >>>>>>>>> request to the batch coordinator, we send multiple parallel
> > >>>> commit
> > >>>>>>> requests
> > >>>>>>>>> to all the leaders of each partition involved in the WAL file.
> > >>>> Each
> > >>>>>> of
> > >>>>>>> them
> > >>>>>>>>> will commit its batches independently, without coordinating
> with
> > >>>>>> other
> > >>>>>>>>> leaders and any other components. Batch data is addressed by
> the
> > >>>> WAL
> > >>>>>>> file
> > >>>>>>>>> name, the byte offset and size, which allows partitions to know
> > >>>>>> nothing
> > >>>>>>>>> about other partitions to access their data in shared WAL
> files.
> > >>>>>>>>>
> > >>>>>>>>> The number of partitions involved in a single WAL file may be
> > >>>> quite
> > >>>>>>> large,
> > >>>>>>>>> e.g. a hundred. A hundred network requests to commit one WAL
> > >>>> file is
> > >>>>>>> very
> > >>>>>>>>> impractical. However, there are ways to reduce this number:
> > >>>>>>>>> 1. Partition leaders are located on brokers. Requests to
> > >>>> leaders on
> > >>>>>>> one
> > >>>>>>>>> broker could be grouped together into a single physical network
> > >>>>>> request
> > >>>>>>>>> (resembling the normal Produce request that may carry batches
> for
> > >>>>>> many
> > >>>>>>>>> partitions inside). This will cap the number of network
> requests
> > >>>> to
> > >>>>>> the
> > >>>>>>>>> number of brokers in the cluster.
> > >>>>>>>>> 2. If we craft the cluster metadata to make producers send
> their
> > >>>>>>> requests
> > >>>>>>>>> to the right brokers (with respect to AZs), we may achieve the
> > >>>> higher
> > >>>>>>>>> concentration of logical commit requests in physical network
> > >>>> requests
> > >>>>>>>>> reducing the number of the latter ones even further, ideally to
> > >>>> one.
> > >>>>>>>>>
> > >>>>>>>>> Obviously, out of multiple commit requests some may fail or
> time
> > >>>> out
> > >>>>>>> for a
> > >>>>>>>>> variety of reasons. This is fine. Some producers will receive
> > >>>> totally
> > >>>>>>> or
> > >>>>>>>>> partially failed responses to their Produce requests, similar
> to
> > >>>> what
> > >>>>>>> they
> > >>>>>>>>> would have received when appending to a classic topic fails or
> > >>>> times
> > >>>>>>> out.
> > >>>>>>>>> If a partition experiences problems, other partitions will not
> be
> > >>>>>>> affected
> > >>>>>>>>> (again, like in classic topics). Of course, the uncommitted
> data
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>> garbage in WAL files. But WAL files are short-lived (batches
> are
> > >>>>>>> constantly
> > >>>>>>>>> assembled into segments and offloaded to tiered storage), so
> this
> > >>>>>>> garbage
> > >>>>>>>>> will be eventually deleted.
> > >>>>>>>>>
> > >>>>>>>>> For safely deleting WAL files we now need to centrally manage
> > >>>> them,
> > >>>>>> as
> > >>>>>>>>> this is the only state and logic that spans multiple
> partitions.
> > >>>> On
> > >>>>>> the
> > >>>>>>>>> diagram, you can see another commit request called “Commit file
> > >>>> (best
> > >>>>>>>>> effort)” going to the WAL File Manager. This manager will be
> > >>>>>>> responsible
> > >>>>>>>>> for the following:
> > >>>>>>>>> 1. Collecting (by requests from brokers) and persisting
> > >>>> information
> > >>>>>>> about
> > >>>>>>>>> committed WAL files.
> > >>>>>>>>> 2. To handle potential failures in file information delivery,
> it
> > >>>>>> will
> > >>>>>>> be
> > >>>>>>>>> doing prefix scan on the remote storage periodically to find
> and
> > >>>>>>> register
> > >>>>>>>>> unknown files. The period of this scan will be configurable and
> > >>>>>> ideally
> > >>>>>>>>> should be quite long.
> > >>>>>>>>> 3. Checking with the relevant partition leaders (after a grace
> > >>>>>>> period) if
> > >>>>>>>>> they still have batches in a particular file.
> > >>>>>>>>> 4. Physically deleting files when they aren’t anymore referred
> > >>>> to by
> > >>>>>>> any
> > >>>>>>>>> partition.
> > >>>>>>>>>
> > >>>>>>>>> This new design offers the following advantages:
> > >>>>>>>>> 1. It simplifies the implementation of many Kafka features such
> > >>>> as
> > >>>>>>>>> idempotence, transactions, queues, tiered storage, retention.
> > >>>> Now we
> > >>>>>>> don’t
> > >>>>>>>>> need to abstract away and reuse the code from partition leaders
> > >>>> in
> > >>>>>> the
> > >>>>>>>>> batch coordinator. Instead, we will literally use the same code
> > >>>> paths
> > >>>>>>> in
> > >>>>>>>>> leaders, with little adaptation. Workflows from classic topics
> > >>>> mostly
> > >>>>>>>>> remain unchanged.
> > >>>>>>>>> For example, it seems that
> > >>>>>>>>> ReplicaManager.maybeSendPartitionsToTransactionCoordinator  and
> > >>>>>>>>> KafkaApis.handleWriteTxnMarkersRequest used for transaction
> > >>>> support
> > >>>>>> on
> > >>>>>>> the
> > >>>>>>>>> partition leader side could be used for diskless topics with
> > >>>> little
> > >>>>>>>>> adaptation. ProducerStateManager, needed for both idempotent
> > >>>> produce
> > >>>>>>> and
> > >>>>>>>>> transactions, would be reused.
> > >>>>>>>>> Another example is share groups support, where the share
> > >>>> partition
> > >>>>>>> leader,
> > >>>>>>>>> being co-located with the partition leader, would execute the
> > >>>> same
> > >>>>>>> logic
> > >>>>>>>>> for both diskless and classic topics.
> > >>>>>>>>> 2. It returns to the familiar partition-based scaling model,
> > >>>> where
> > >>>>>>>>> partitions are independent.
> > >>>>>>>>> 3. It makes the operation and failure patterns closer to the
> > >>>>>> familiar
> > >>>>>>>>> ones from classic topics.
> > >>>>>>>>> 4. It opens a straightforward path to seamless switching the
> > >>>> topics
> > >>>>>>> modes
> > >>>>>>>>> between diskless and classic.
> > >>>>>>>>>
> > >>>>>>>>> The rest of the things remain unchanged compared to the
> previous
> > >>>>>>> Diskless
> > >>>>>>>>> design (after all previous discussions). Such things as local
> > >>>> segment
> > >>>>>>>>> materialization by replicas, the consume path, tiered storage
> > >>>>>>> integration,
> > >>>>>>>>> etc.
> > >>>>>>>>>
> > >>>>>>>>> If the community finds this design more suitable, we will
> update
> > >>>> the
> > >>>>>>>>> KIP(s) accordingly and continue working on it. Please let us
> know
> > >>>>>> what
> > >>>>>>> you
> > >>>>>>>>> think.
> > >>>>>>>>>
> > >>>>>>>>> Best regards,
> > >>>>>>>>> Ivan and Diskless team
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote:
> > >>>>>>>>>> Hi Justine,
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, you're right. We need to track the aborted transactions
> > >>>> for in
> > >>>>>>> the
> > >>>>>>>>> diskless coordinator for as long as the corresponding offsets
> are
> > >>>>>>> there.
> > >>>>>>>>> With the tiered storage unification Greg mentioned earlier,
> this
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>> finite time even for infinite data retention.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Ivan
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote:
> > >>>>>>>>>>> Hey Ivan,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the response. I think most of what you said made
> > >>>>>> sense,
> > >>>>>>> but
> > >>>>>>>>> I
> > >>>>>>>>>>> did have some questions about this part:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > >>>> topics
> > >>>>>>> forgets
> > >>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses
> > >>>> it). The
> > >>>>>>>>>>> transaction coordinator acts like the main guardian, allowing
> > >>>>>>> partition
> > >>>>>>>>>>> leaders to do this safely. Please correct me if this is
> > >>>> wrong. We
> > >>>>>>> think
> > >>>>>>>>>>> about relying on this with the batch coordinator and delete
> > >>>> the
> > >>>>>>>>> information
> > >>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > >>>> replication
> > >>>>>>> and
> > >>>>>>>>> HWM
> > >>>>>>>>>>> advances immediately).
> > >>>>>>>>>>>
> > >>>>>>>>>>> I didn't quite understand this. In classic topics, we have
> > >>>> maps
> > >>>>>> for
> > >>>>>>>>> ongoing
> > >>>>>>>>>>> transactions which remove state when the transaction is
> > >>>> completed
> > >>>>>>> and
> > >>>>>>>>> an
> > >>>>>>>>>>> aborted transactions index which is retained for much longer.
> > >>>>>> Once
> > >>>>>>> the
> > >>>>>>>>>>> transaction is completed, the coordinator is no longer
> > >>>> involved
> > >>>>>> in
> > >>>>>>>>>>> maintaining this partition side state, and it is subject to
> > >>>>>>> compaction
> > >>>>>>>>> etc.
> > >>>>>>>>>>> Looking back at the outline provided above, I didn't see much
> > >>>>>>> about the
> > >>>>>>>>>>> fetch path, so maybe that could be expanded a bit further. I
> > >>>> saw
> > >>>>>>> the
> > >>>>>>>>>>> following in a response:
> > >>>>>>>>>>>> When the broker constructs a fully valid local segment,
> > >>>> all the
> > >>>>>>>>> necessary
> > >>>>>>>>>>> control batches will be inserted and indices, including the
> > >>>>>>> transaction
> > >>>>>>>>>>> index will be built to serve FetchRequests exactly as they
> > >>>> are
> > >>>>>>> today.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Based on this, it seems like we need to retain the
> > >>>> information
> > >>>>>>> about
> > >>>>>>>>>>> aborted txns for longer.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Justine
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <
> > >>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Justine and all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thank you for your questions!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified
> > >>>> with
> > >>>>>>>>> producer ID
> > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > >>>> cached
> > >>>>>>>>> locally
> > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > >>>>>> transactions
> > >>>>>>> can
> > >>>>>>>>> be
> > >>>>>>>>>>>> used
> > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > >>>> with
> > >>>>>>>>> producer id +
> > >>>>>>>>>>>>> epoch
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> You’re right that we (probably unintentionally) focused
> > >>>> only on
> > >>>>>>>>> version 2.
> > >>>>>>>>>>>> We can either limit the support to version 2 or consider
> > >>>> using
> > >>>>>>> some
> > >>>>>>>>>>>> surrogates to support version 1.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final transactional
> > >>>>>>> checks
> > >>>>>>>>> of the
> > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > >>>> like the
> > >>>>>>>>> partition
> > >>>>>>>>>>>>> leader in classic topics would do.
> > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > >>>>>> checking
> > >>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>>>>> transaction was still ongoing for example?* *
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Yes, the producer epoch, that the transaction is ongoing,
> > >>>> and
> > >>>>>> of
> > >>>>>>>>> course
> > >>>>>>>>>>>> the normal idempotence checks. What the partition leader
> > >>>> in the
> > >>>>>>>>> classic
> > >>>>>>>>>>>> topics does before appending a batch to the local log
> > >>>> (e.g. in
> > >>>>>>>>>>>> UnifiedLog.maybeStartTransactionVerification and
> > >>>>>>>>>>>> UnifiedLog.analyzeAndValidateProducerState). In Diskless,
> > >>>> we
> > >>>>>>>>> unfortunately
> > >>>>>>>>>>>> cannot do these checks before appending the data to the WAL
> > >>>>>>> segment
> > >>>>>>>>> and
> > >>>>>>>>>>>> uploading it, but we can “tombstone” these batches in the
> > >>>> batch
> > >>>>>>>>> coordinator
> > >>>>>>>>>>>> during the final commit.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Is there state about ongoing
> > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other
> > >>>> state
> > >>>>>>>>> mentioned
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > >>>> what
> > >>>>>>> state is
> > >>>>>>>>>>>> stored
> > >>>>>>>>>>>>> and when it is stored.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Right, this should have been more explicit. As the
> > >>>> partition
> > >>>>>>> leader
> > >>>>>>>>> tracks
> > >>>>>>>>>>>> ongoing transactions for classic topics, the batch
> > >>>> coordinator
> > >>>>>>> has
> > >>>>>>>>> to as
> > >>>>>>>>>>>> well. So when a transaction starts and ends, the
> > >>>> transaction
> > >>>>>>>>> coordinator
> > >>>>>>>>>>>> must inform the batch coordinator about this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > >>>> perhaps
> > >>>>>>> that
> > >>>>>>>>> would
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> stored in the batch coordinator?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Yes. This could be deduced from the committed batches and
> > >>>> other
> > >>>>>>>>>>>> information, but for the sake of performance we’d better
> > >>>> store
> > >>>>>> it
> > >>>>>>>>>>>> explicitly.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long transactional
> > >>>>>>> state is
> > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be
> > >>>>>> cleaned
> > >>>>>>> up?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > >>>> topics
> > >>>>>>> forgets
> > >>>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses
> > >>>> it).
> > >>>>>> The
> > >>>>>>>>>>>> transaction coordinator acts like the main guardian,
> > >>>> allowing
> > >>>>>>>>> partition
> > >>>>>>>>>>>> leaders to do this safely. Please correct me if this is
> > >>>> wrong.
> > >>>>>> We
> > >>>>>>>>> think
> > >>>>>>>>>>>> about relying on this with the batch coordinator and
> > >>>> delete the
> > >>>>>>>>> information
> > >>>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > >>>>>> replication
> > >>>>>>>>> and HWM
> > >>>>>>>>>>>> advances immediately).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Ivan
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> > >>>>>>>>>>>>> Hey folks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Excited to see some updates related to transactions!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I had a few questions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified
> > >>>> with
> > >>>>>>>>> producer ID
> > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > >>>> cached
> > >>>>>>>>> locally
> > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > >>>>>> transactions
> > >>>>>>> can
> > >>>>>>>>> be
> > >>>>>>>>>>>> used
> > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > >>>> with
> > >>>>>>>>> producer id +
> > >>>>>>>>>>>>> epoch
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final transactional
> > >>>>>>> checks
> > >>>>>>>>> of the
> > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > >>>> like the
> > >>>>>>>>> partition
> > >>>>>>>>>>>>> leader in classic topics would do.
> > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > >>>>>> checking
> > >>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>>>>> transaction was still ongoing for example? Is there state
> > >>>>>> about
> > >>>>>>>>> ongoing
> > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other
> > >>>> state
> > >>>>>>>>> mentioned
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > >>>> what
> > >>>>>>> state is
> > >>>>>>>>>>>> stored
> > >>>>>>>>>>>>> and when it is stored.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > >>>> perhaps
> > >>>>>>> that
> > >>>>>>>>> would
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> stored in the batch coordinator?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long transactional
> > >>>>>>> state is
> > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be
> > >>>>>> cleaned
> > >>>>>>> up?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao
> > >>>>>>> <[email protected]>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi, Greg and Ivan,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the update. A few comments.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> JR 10. "Consumer fetches are now served from local
> > >>>>>> segments,
> > >>>>>>>>> making
> > >>>>>>>>>>>> use of
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> indexes, page cache, request purgatory, and zero-copy
> > >>>>>>>>> functionality
> > >>>>>>>>>>>> already
> > >>>>>>>>>>>>>> built into classic topics."
> > >>>>>>>>>>>>>> JR 10.1 Does the broker build the producer state for
> > >>>> each
> > >>>>>>>>> partition in
> > >>>>>>>>>>>>>> diskless topics?
> > >>>>>>>>>>>>>> JR 10.2 For transactional data, the consumer fetches
> > >>>> need
> > >>>>>> to
> > >>>>>>> know
> > >>>>>>>>>>>> aborted
> > >>>>>>>>>>>>>> records. How is that achieved?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> JR 11. "The batch coordinator saves that the
> > >>>> transaction is
> > >>>>>>>>> finished
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>> also inserts the control batches in the corresponding
> > >>>> logs
> > >>>>>>> of the
> > >>>>>>>>>>>> involved
> > >>>>>>>>>>>>>> Diskless topics. This happens only on the metadata
> > >>>> level,
> > >>>>>> no
> > >>>>>>>>> actual
> > >>>>>>>>>>>> control
> > >>>>>>>>>>>>>> batches are written to any file. "
> > >>>>>>>>>>>>>> A fetch response could include multiple transactional
> > >>>>>>> batches.
> > >>>>>>>>> How
> > >>>>>>>>>>>> does the
> > >>>>>>>>>>>>>> broker obtain the information about the ending control
> > >>>>>> batch
> > >>>>>>> for
> > >>>>>>>>> each
> > >>>>>>>>>>>>>> batch? Does that mean that a fetch response needs to be
> > >>>>>>> built by
> > >>>>>>>>>>>>>> stitching record batches and generated control batches
> > >>>>>>> together?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> JR 12. Queues: Is there still a share partition leader
> > >>>> that
> > >>>>>>> all
> > >>>>>>>>>>>> consumers
> > >>>>>>>>>>>>>> are routed to?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> JR 13. "Should the KIPs be modified to include this or
> > >>>> it's
> > >>>>>>> too
> > >>>>>>>>>>>>>> implementation-focused?" It would be useful to include
> > >>>>>> enough
> > >>>>>>>>> details
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>> understand correctness and performance impact.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> HC5. Henry has a valid point. Requests from a given
> > >>>>>> producer
> > >>>>>>>>> contain a
> > >>>>>>>>>>>>>> sequence number, which is ordered. If a producer sends
> > >>>>>> every
> > >>>>>>>>> Produce
> > >>>>>>>>>>>>>> request to an arbitrary broker, those requests could
> > >>>> reach
> > >>>>>>> the
> > >>>>>>>>> batch
> > >>>>>>>>>>>>>> coordinator in different order and lead to rejection
> > >>>> of the
> > >>>>>>>>> produce
> > >>>>>>>>>>>>>> requests.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jun
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <
> > >>>>>>> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> We have also thought in a bit more details about
> > >>>>>>> transactions
> > >>>>>>>>> and
> > >>>>>>>>>>>> queues,
> > >>>>>>>>>>>>>>> here's the plan.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> *Transactions*
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The support for transactions in *classic topics* is
> > >>>> based
> > >>>>>>> on
> > >>>>>>>>> precise
> > >>>>>>>>>>>>>>> interactions between three actors: clients (mostly
> > >>>>>>> producers,
> > >>>>>>>>> but
> > >>>>>>>>>>>> also
> > >>>>>>>>>>>>>>> consumers), brokers (ReplicaManager and other
> > >>>> classes),
> > >>>>>> and
> > >>>>>>>>>>>> transaction
> > >>>>>>>>>>>>>>> coordinators. Brokers also run partition leaders with
> > >>>>>> their
> > >>>>>>>>> local
> > >>>>>>>>>>>> state
> > >>>>>>>>>>>>>>> (ProducerStateManager and others).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The high level (some details skipped) workflow is the
> > >>>>>>>>> following.
> > >>>>>>>>>>>> When a
> > >>>>>>>>>>>>>>> transactional Produce request is received by the
> > >>>> broker:
> > >>>>>>>>>>>>>>> 1. For each partition, the partition leader checks
> > >>>> if a
> > >>>>>>>>> non-empty
> > >>>>>>>>>>>>>>> transaction is running for this partition. This is
> > >>>> done
> > >>>>>>> using
> > >>>>>>>>> its
> > >>>>>>>>>>>> local
> > >>>>>>>>>>>>>>> state derived from the log metadata
> > >>>>>> (ProducerStateManager,
> > >>>>>>>>>>>>>>> VerificationStateEntry, VerificationGuard).
> > >>>>>>>>>>>>>>> 2. The transaction coordinator is informed about all
> > >>>> the
> > >>>>>>>>> partitions
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>>> aren’t part of the transaction to include them.
> > >>>>>>>>>>>>>>> 3. The partition leaders do additional transactional
> > >>>>>>> checks.
> > >>>>>>>>>>>>>>> 4. The partition leaders append the transactional
> > >>>> data to
> > >>>>>>>>> their logs
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>> update some of their state (for example, log the fact
> > >>>>>> that
> > >>>>>>> the
> > >>>>>>>>>>>>>> transaction
> > >>>>>>>>>>>>>>> is running for the partition and its first offset).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator
> > >>>>>>> directly
> > >>>>>>>>> with
> > >>>>>>>>>>>>>>> EndTxnRequest.
> > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT
> > >>>> or
> > >>>>>>>>>>>> PREPARE_ABORT to
> > >>>>>>>>>>>>>>> its log and responds to the producer.
> > >>>>>>>>>>>>>>> 3. The transaction coordinator sends
> > >>>>>>> WriteTxnMarkersRequest to
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> leaders
> > >>>>>>>>>>>>>>> of the involved partitions.
> > >>>>>>>>>>>>>>> 4. The partition leaders write the transaction
> > >>>> markers to
> > >>>>>>>>> their logs
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>> respond to the coordinator.
> > >>>>>>>>>>>>>>> 5. The coordinator writes the final transaction state
> > >>>>>>>>>>>> COMPLETE_COMMIT or
> > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In classic topics, partitions have leaders and lots
> > >>>> of
> > >>>>>>>>> important
> > >>>>>>>>>>>> state
> > >>>>>>>>>>>>>>> necessary for supporting this workflow is local. The
> > >>>> main
> > >>>>>>>>> challenge
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>> mapping this to Diskless comes from the fact there
> > >>>> are no
> > >>>>>>>>> partition
> > >>>>>>>>>>>>>>> leaders, so the corresponding pieces of state need
> > >>>> to be
> > >>>>>>>>> globalized
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> batch coordinator. We are already doing this to
> > >>>> support
> > >>>>>>>>> idempotent
> > >>>>>>>>>>>>>> produce.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The high level workflow for *diskless topics* would
> > >>>> look
> > >>>>>>> very
> > >>>>>>>>>>>> similar:
> > >>>>>>>>>>>>>>> 1. For each partition, the broker checks if a
> > >>>> non-empty
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>>>> running for this partition. In contrast to classic
> > >>>>>> topics,
> > >>>>>>>>> this is
> > >>>>>>>>>>>>>> checked
> > >>>>>>>>>>>>>>> against the batch coordinator with a single RPC.
> > >>>> Since a
> > >>>>>>>>> transaction
> > >>>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>> be uniquely identified with producer ID and epoch,
> > >>>> the
> > >>>>>>> positive
> > >>>>>>>>>>>> result of
> > >>>>>>>>>>>>>>> this check could be cached locally (for the double
> > >>>>>>> configured
> > >>>>>>>>>>>> duration
> > >>>>>>>>>>>>>> of a
> > >>>>>>>>>>>>>>> transaction, for example).
> > >>>>>>>>>>>>>>> 2. The same: The transaction coordinator is informed
> > >>>>>> about
> > >>>>>>> all
> > >>>>>>>>> the
> > >>>>>>>>>>>>>>> partitions that aren’t part of the transaction to
> > >>>> include
> > >>>>>>> them.
> > >>>>>>>>>>>>>>> 3. No transactional checks are done on the broker
> > >>>> side.
> > >>>>>>>>>>>>>>> 4. The broker appends the transactional data to the
> > >>>>>> current
> > >>>>>>>>> shared
> > >>>>>>>>>>>> WAL
> > >>>>>>>>>>>>>>> segment. It doesn’t update any transaction-related
> > >>>> state
> > >>>>>>> for
> > >>>>>>>>> Diskless
> > >>>>>>>>>>>>>>> topics, because it doesn’t have any.
> > >>>>>>>>>>>>>>> 5. The WAL segment is committed to the batch
> > >>>> coordinator
> > >>>>>>> like
> > >>>>>>>>> in the
> > >>>>>>>>>>>>>>> normal produce flow.
> > >>>>>>>>>>>>>>> 6. The batch coordinator does the final transactional
> > >>>>>>> checks
> > >>>>>>>>> of the
> > >>>>>>>>>>>>>>> batches. This procedure would output the same errors
> > >>>> like
> > >>>>>>> the
> > >>>>>>>>>>>> partition
> > >>>>>>>>>>>>>>> leader in classic topics would do. I.e. some batches
> > >>>>>> could
> > >>>>>>> be
> > >>>>>>>>>>>> rejected.
> > >>>>>>>>>>>>>>> This means, there will potentially be garbage in the
> > >>>> WAL
> > >>>>>>>>> segment
> > >>>>>>>>>>>> file in
> > >>>>>>>>>>>>>>> case of transactional errors. This is preferable to
> > >>>> doing
> > >>>>>>> more
> > >>>>>>>>>>>> network
> > >>>>>>>>>>>>>>> round trips, especially considering the WAL segments
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>>>>> relatively
> > >>>>>>>>>>>>>>> short-living (see the Greg's update above).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator
> > >>>>>>> directly
> > >>>>>>>>> with
> > >>>>>>>>>>>>>>> EndTxnRequest.
> > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT
> > >>>> or
> > >>>>>>>>>>>> PREPARE_ABORT to
> > >>>>>>>>>>>>>>> its log and responds to the producer.
> > >>>>>>>>>>>>>>> 3. *[NEW]* The transaction coordinator informs the
> > >>>> batch
> > >>>>>>>>> coordinator
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>>> the transaction is finished.
> > >>>>>>>>>>>>>>> 4. *[NEW]* The batch coordinator saves that the
> > >>>>>>> transaction is
> > >>>>>>>>>>>> finished
> > >>>>>>>>>>>>>>> and also inserts the control batches in the
> > >>>> corresponding
> > >>>>>>> logs
> > >>>>>>>>> of the
> > >>>>>>>>>>>>>>> involved Diskless topics. This happens only on the
> > >>>>>> metadata
> > >>>>>>>>> level, no
> > >>>>>>>>>>>>>>> actual control batches are written to any file. They
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>>>>> dynamically
> > >>>>>>>>>>>>>>> created on Fetch and other read operations. We could
> > >>>>>>>>> technically
> > >>>>>>>>>>>> write
> > >>>>>>>>>>>>>>> these control batches for real, but this would mean
> > >>>> extra
> > >>>>>>>>> produce
> > >>>>>>>>>>>>>> latency,
> > >>>>>>>>>>>>>>> so it's better just to mark them in the batch
> > >>>> coordinator
> > >>>>>>> and
> > >>>>>>>>> save
> > >>>>>>>>>>>> these
> > >>>>>>>>>>>>>>> milliseconds.
> > >>>>>>>>>>>>>>> 5. The transaction coordinator sends
> > >>>>>>> WriteTxnMarkersRequest to
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> leaders
> > >>>>>>>>>>>>>>> of the involved partitions. – Now only to classic
> > >>>> topics
> > >>>>>>> now.
> > >>>>>>>>>>>>>>> 6. The partition leaders of classic topics write the
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> markers
> > >>>>>>>>>>>>>>> to their logs and respond to the coordinator.
> > >>>>>>>>>>>>>>> 7. The coordinator writes the final transaction state
> > >>>>>>>>>>>> COMPLETE_COMMIT or
> > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Compared to the non-transactional produce flow, we
> > >>>> get:
> > >>>>>>>>>>>>>>> 1. An extra network round trip between brokers and
> > >>>> the
> > >>>>>>> batch
> > >>>>>>>>>>>> coordinator
> > >>>>>>>>>>>>>>> when a new partition appear in the transaction. To
> > >>>>>>> mitigate the
> > >>>>>>>>>>>> impact of
> > >>>>>>>>>>>>>>> them:
> > >>>>>>>>>>>>>>> - The results will be cached.
> > >>>>>>>>>>>>>>> - The calls for multiple partitions in one Produce
> > >>>>>>> request
> > >>>>>>>>> will be
> > >>>>>>>>>>>>>>> grouped.
> > >>>>>>>>>>>>>>> - The batch coordinator should be optimized for
> > >>>> fast
> > >>>>>>>>> response to
> > >>>>>>>>>>>> these
> > >>>>>>>>>>>>>>> RPCs.
> > >>>>>>>>>>>>>>> - The fact that a single producer normally will
> > >>>>>>> communicate
> > >>>>>>>>> with a
> > >>>>>>>>>>>>>>> single broker for the duration of the transaction
> > >>>> further
> > >>>>>>>>> reduces the
> > >>>>>>>>>>>>>>> expected number of round trips.
> > >>>>>>>>>>>>>>> 2. An extra round trip between the transaction
> > >>>>>> coordinator
> > >>>>>>> and
> > >>>>>>>>> batch
> > >>>>>>>>>>>>>>> coordinator when a transaction is finished.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> With this proposal, transactions will also be able to
> > >>>>>> span
> > >>>>>>> both
> > >>>>>>>>>>>> classic
> > >>>>>>>>>>>>>>> and Diskless topics.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> *Queues*
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The share group coordination and management is a
> > >>>> side job
> > >>>>>>> that
> > >>>>>>>>>>>> doesn't
> > >>>>>>>>>>>>>>> interfere with the topic itself (leadership,
> > >>>> replicas,
> > >>>>>>> physical
> > >>>>>>>>>>>> storage
> > >>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> records, etc.) and non-queue producers and consumers
> > >>>>>>> (Fetch and
> > >>>>>>>>>>>> Produce
> > >>>>>>>>>>>>>>> RPCs, consumer group-related RPCs are not affected.)
> > >>>> We
> > >>>>>>> don't
> > >>>>>>>>> see any
> > >>>>>>>>>>>>>>> reason why we can't make Diskless topics compatible
> > >>>> with
> > >>>>>>> share
> > >>>>>>>>>>>> groups the
> > >>>>>>>>>>>>>>> same way as classic topics are. Even on the code
> > >>>> level,
> > >>>>>> we
> > >>>>>>>>> don't
> > >>>>>>>>>>>> expect
> > >>>>>>>>>>>>>> any
> > >>>>>>>>>>>>>>> serious refactoring: the same reading routines are
> > >>>> used
> > >>>>>>> that
> > >>>>>>>>> are
> > >>>>>>>>>>>> used for
> > >>>>>>>>>>>>>>> fetching (e.g. ReplicaManager.readFromLog).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Should the KIPs be modified to include this or it's
> > >>>> too
> > >>>>>>>>>>>>>>> implementation-focused?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>> Ivan
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > >>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thank you all for your questions and design input
> > >>>> on
> > >>>>>>>>> KIP-1150.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> We have just updated KIP-1150 and KIP-1163 with a
> > >>>> new
> > >>>>>>>>> design. To
> > >>>>>>>>>>>>>>> summarize
> > >>>>>>>>>>>>>>>> the changes:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 1. The design prioritizes integrating with the
> > >>>> existing
> > >>>>>>>>> KIP-405
> > >>>>>>>>>>>> Tiered
> > >>>>>>>>>>>>>>>> Storage interfaces, permitting data produced to a
> > >>>>>>> Diskless
> > >>>>>>>>> topic
> > >>>>>>>>>>>> to be
> > >>>>>>>>>>>>>>>> moved to tiered storage.
> > >>>>>>>>>>>>>>>> This lowers the scalability requirements for the
> > >>>> Batch
> > >>>>>>>>> Coordinator
> > >>>>>>>>>>>>>>>> component, and allows Diskless to compose with
> > >>>> Tiered
> > >>>>>>> Storage
> > >>>>>>>>>>>> plugin
> > >>>>>>>>>>>>>>>> features such as encryption and alternative data
> > >>>>>> formats.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 2. Consumer fetches are now served from local
> > >>>> segments,
> > >>>>>>>>> making use
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> indexes, page cache, request purgatory, and
> > >>>> zero-copy
> > >>>>>>>>> functionality
> > >>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>> built into classic topics.
> > >>>>>>>>>>>>>>>> However, local segments are now considered cache
> > >>>>>>> elements,
> > >>>>>>>>> do not
> > >>>>>>>>>>>> need
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> be durably stored, and can be built without
> > >>>> contacting
> > >>>>>>> any
> > >>>>>>>>> other
> > >>>>>>>>>>>>>>> replicas.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 3. The design has been simplified substantially, by
> > >>>>>>> removing
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> previous
> > >>>>>>>>>>>>>>>> Diskless consume flow, distributed cache
> > >>>> component, and
> > >>>>>>>>> "object
> > >>>>>>>>>>>>>>>> compaction/merging" step.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The design maintains leaderless produces as
> > >>>> enabled by
> > >>>>>>> the
> > >>>>>>>>> Batch
> > >>>>>>>>>>>>>>>> Coordinator, and the same latency profiles as the
> > >>>>>> earlier
> > >>>>>>>>> design,
> > >>>>>>>>>>>> while
> > >>>>>>>>>>>>>>>> being simpler and integrating better into the
> > >>>> existing
> > >>>>>>>>> ecosystem.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks, and we are eager to hear your feedback on
> > >>>> the
> > >>>>>> new
> > >>>>>>>>> design.
> > >>>>>>>>>>>>>>>> Greg Harris
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao
> > >>>>>>>>> <[email protected]>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi, Jan,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For me, the main gap of KIP-1150 is the support
> > >>>> of
> > >>>>>> all
> > >>>>>>>>> existing
> > >>>>>>>>>>>>>> client
> > >>>>>>>>>>>>>>>>> APIs. Currently, there is no design for
> > >>>> supporting
> > >>>>>> APIs
> > >>>>>>>>> like
> > >>>>>>>>>>>>>>> transactions
> > >>>>>>>>>>>>>>>>> and queues.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jun
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > >>>>>>>>>>>>>>>>> <[email protected]> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Would it be a good time to ask for the current
> > >>>>>>> status of
> > >>>>>>>>> this
> > >>>>>>>>>>>> KIP?
> > >>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>> haven't seen much activity here for the past 2
> > >>>>>>> months,
> > >>>>>>>>> the
> > >>>>>>>>>>>> vote got
> > >>>>>>>>>>>>>>>>> vetoed
> > >>>>>>>>>>>>>>>>>> but I think the pending questions have been
> > >>>>>> answered
> > >>>>>>>>> since
> > >>>>>>>>>>>> then.
> > >>>>>>>>>>>>>>> KIP-1183
> > >>>>>>>>>>>>>>>>>> (AutoMQ's proposal) also didn't have any
> > >>>> activity
> > >>>>>>> since
> > >>>>>>>>> May.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> In my eyes KIP-1150 and KIP-1183 are two real
> > >>>>>> choices
> > >>>>>>>>> that can
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>> made, with a coordinator-based approach being
> > >>>> by
> > >>>>>> far
> > >>>>>>> the
> > >>>>>>>>>>>> dominant
> > >>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>> when
> > >>>>>>>>>>>>>>>>>> it comes to market adoption - but all these are
> > >>>>>>>>> standalone
> > >>>>>>>>>>>>>> products.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I'm a big fan of both approaches, but would
> > >>>> hate to
> > >>>>>>> see a
> > >>>>>>>>>>>> stall. So
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> question is: can we get an update?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Maybe it's time to start another vote? Colin
> > >>>>>> McCabe -
> > >>>>>>>>> have your
> > >>>>>>>>>>>>>>> questions
> > >>>>>>>>>>>>>>>>>> been answered? If not, is there anything I can
> > >>>> do
> > >>>>>> to
> > >>>>>>>>> help? I'm
> > >>>>>>>>>>>>>> deeply
> > >>>>>>>>>>>>>>>>>> familiar with both architectures and have
> > >>>> written
> > >>>>>>> about
> > >>>>>>>>> both?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Kind regards,
> > >>>>>>>>>>>>>>>>>> Jan
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2025 at 10:42 AM Stanislav
> > >>>>>> Kozlovski
> > >>>>>>> <
> > >>>>>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I have some nits - it may be useful to
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> a) group all the KIP email threads in the
> > >>>> main
> > >>>>>> one
> > >>>>>>>>> (just a
> > >>>>>>>>>>>> bunch
> > >>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> links
> > >>>>>>>>>>>>>>>>>>> to everything)
> > >>>>>>>>>>>>>>>>>>> b) create the email threads
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> It's a bit hard to track it all - for
> > >>>> example, I
> > >>>>>>> was
> > >>>>>>>>>>>> searching
> > >>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> discuss thread for KIP-1165 for a while; As
> > >>>> far
> > >>>>>> as
> > >>>>>>> I
> > >>>>>>>>> can
> > >>>>>>>>>>>> tell, it
> > >>>>>>>>>>>>>>>>> doesn't
> > >>>>>>>>>>>>>>>>>>> exist yet.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Since the KIPs are published (by virtue of
> > >>>> having
> > >>>>>>> the
> > >>>>>>>>> root
> > >>>>>>>>>>>> KIP be
> > >>>>>>>>>>>>>>>>>>> published, having a DISCUSS thread and links
> > >>>> to
> > >>>>>>>>> sub-KIPs
> > >>>>>>>>>>>> where
> > >>>>>>>>>>>>>> were
> > >>>>>>>>>>>>>>>>> aimed
> > >>>>>>>>>>>>>>>>>>> to move the discussion towards), I think it
> > >>>> would
> > >>>>>>> be
> > >>>>>>>>> good to
> > >>>>>>>>>>>>>> create
> > >>>>>>>>>>>>>>>>>> DISCUSS
> > >>>>>>>>>>>>>>>>>>> threads for them all.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>> Stan
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On 2025/04/16 11:58:22 Josep Prat wrote:
> > >>>>>>>>>>>>>>>>>>>> Hi Kafka Devs!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We want to start a new KIP discussion about
> > >>>>>>>>> introducing a
> > >>>>>>>>>>>> new
> > >>>>>>>>>>>>>>> type of
> > >>>>>>>>>>>>>>>>>>>> topics that would make use of Object
> > >>>> Storage as
> > >>>>>>> the
> > >>>>>>>>> primary
> > >>>>>>>>>>>>>>> source of
> > >>>>>>>>>>>>>>>>>>>> storage. However, as this KIP is big we
> > >>>> decided
> > >>>>>>> to
> > >>>>>>>>> split it
> > >>>>>>>>>>>>>> into
> > >>>>>>>>>>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>>> related KIPs.
> > >>>>>>>>>>>>>>>>>>>> We have the motivational KIP-1150 (
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > >>>>>>>>>>>>>>>>>>> )
> > >>>>>>>>>>>>>>>>>>>> that aims to discuss if Apache Kafka
> > >>>> should aim
> > >>>>>>> to
> > >>>>>>>>> have
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>>>> type of
> > >>>>>>>>>>>>>>>>>>>> feature at all. This KIP doesn't go onto
> > >>>>>> details
> > >>>>>>> on
> > >>>>>>>>> how to
> > >>>>>>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>>> it.
> > >>>>>>>>>>>>>>>>>>>> This follows the same approach used when we
> > >>>>>>> discussed
> > >>>>>>>>>>>> KRaft.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> But as we know that it is sometimes really
> > >>>> hard
> > >>>>>>> to
> > >>>>>>>>> discuss
> > >>>>>>>>>>>> on
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> meta
> > >>>>>>>>>>>>>>>>>>>> level, we also created several sub-kips
> > >>>> (linked
> > >>>>>>> in
> > >>>>>>>>>>>> KIP-1150)
> > >>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> offer
> > >>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>> implementation of this feature.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We kindly ask you to use the proper DISCUSS
> > >>>>>>> threads
> > >>>>>>>>> for
> > >>>>>>>>>>>> each
> > >>>>>>>>>>>>>>> type of
> > >>>>>>>>>>>>>>>>>>>> concern and keep this one to discuss
> > >>>> whether
> > >>>>>>> Apache
> > >>>>>>>>> Kafka
> > >>>>>>>>>>>> wants
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>> this feature or not.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks in advance on behalf of all the
> > >>>> authors
> > >>>>>> of
> > >>>>>>>>> this KIP.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> ------------------
> > >>>>>>>>>>>>>>>>>>>> Josep Prat
> > >>>>>>>>>>>>>>>>>>>> Open Source Engineering Director, Aiven
> > >>>>>>>>>>>>>>>>>>>> [email protected]   |   +491715557497 |
> > >>>>>>> aiven.io
> > >>>>>>>>>>>>>>>>>>>> Aiven Deutschland GmbH
> > >>>>>>>>>>>>>>>>>>>> Alexanderufer 3-7, 10117 Berlin
> > >>>>>>>>>>>>>>>>>>>> Geschäftsführer: Oskari Saarenmaa, Hannu
> > >>>>>>> Valtonen,
> > >>>>>>>>>>>>>>>>>>>> Anna Richardson, Kenneth Chen
> > >>>>>>>>>>>>>>>>>>>> Amtsgericht Charlottenburg, HRB 209739 B
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to