Re: [DISCUSS] KIP-1150 Diskless Topics

Ivan Yurchenko Thu, 18 Sep 2025 12:58:27 -0700

Hi Justine and all,

Thank you for your questions!


> JO 1. >Since a transaction could be uniquely identified with producer ID
> and epoch, the positive result of this check could be cached locally
> Are we saying that only new transaction version 2 transactions can be used
> here? If not, we can't uniquely identify transactions with producer id +
> epoch

You’re right that we (probably unintentionally) focused only on version 2. We 
can either limit the support to version 2 or consider using some surrogates to 
support version 1.

> JO 2. >The batch coordinator does the final transactional checks of the
> batches. This procedure would output the same errors like the partition
> leader in classic topics would do.
> Can you expand on what these checks are? Would you be checking if the
> transaction was still ongoing for example?* *

Yes, the producer epoch, that the transaction is ongoing, and of course the 
normal idempotence checks. What the partition leader in the classic topics does 
before appending a batch to the local log (e.g. in 
UnifiedLog.maybeStartTransactionVerification and  
UnifiedLog.analyzeAndValidateProducerState). In Diskless, we unfortunately 
cannot do these checks before appending the data to the WAL segment and 
uploading it, but we can “tombstone” these batches in the batch coordinator 
during the final commit.
 
> Is there state about ongoing
> transactions in the batch coordinator? I see some other state mentioned in
> the End transaction section, but it's not super clear what state is stored
> and when it is stored.

Right, this should have been more explicit. As the partition leader tracks 
ongoing transactions for classic topics, the batch coordinator has to as well. 
So when a transaction starts and ends, the transaction coordinator must inform 
the batch coordinator about this.

> JO 3. I didn't see anything about maintaining LSO -- perhaps that would be
> stored in the batch coordinator?

Yes. This could be deduced from the committed batches and other information, 
but for the sake of performance we’d better store it explicitly.

> JO 4. Are there any thoughts about how long transactional state is
> maintained in the batch coordinator and how it will be cleaned up?

As we understand this, the partition leader in classic topics forgets about a 
transaction once it’s replicated (HWM overpasses it). The transaction 
coordinator acts like the main guardian, allowing partition leaders to do this 
safely. Please correct me if this is wrong. We think about relying on this with 
the batch coordinator and delete the information about a transaction once it’s 
finished (as there’s no replication and HWM advances immediately).

Best,
Ivan

On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> Hey folks,
> 
> Excited to see some updates related to transactions!
> 
> I had a few questions.
> 
> JO 1. >Since a transaction could be uniquely identified with producer ID
> and epoch, the positive result of this check could be cached locally
> Are we saying that only new transaction version 2 transactions can be used
> here? If not, we can't uniquely identify transactions with producer id +
> epoch
> 
> JO 2. >The batch coordinator does the final transactional checks of the
> batches. This procedure would output the same errors like the partition
> leader in classic topics would do.
> Can you expand on what these checks are? Would you be checking if the
> transaction was still ongoing for example? Is there state about ongoing
> transactions in the batch coordinator? I see some other state mentioned in
> the End transaction section, but it's not super clear what state is stored
> and when it is stored.
> 
> JO 3. I didn't see anything about maintaining LSO -- perhaps that would be
> stored in the batch coordinator?
> 
> JO 4. Are there any thoughts about how long transactional state is
> maintained in the batch coordinator and how it will be cleaned up?
> 
> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao <j...@confluent.io.invalid> wrote:
> 
> > Hi, Greg and Ivan,
> >
> > Thanks for the update. A few comments.
> >
> > JR 10. "Consumer fetches are now served from local segments, making use of
> > the
> > indexes, page cache, request purgatory, and zero-copy functionality already
> > built into classic topics."
> > JR 10.1 Does the broker build the producer state for each partition in
> > diskless topics?
> > JR 10.2 For transactional data, the consumer fetches need to know aborted
> > records. How is that achieved?
> >
> > JR 11. "The batch coordinator saves that the transaction is finished and
> > also inserts the control batches in the corresponding logs of the involved
> > Diskless topics. This happens only on the metadata level, no actual control
> > batches are written to any file. "
> > A fetch response could include multiple transactional batches. How does the
> > broker obtain the information about the ending control batch for each
> > batch? Does that mean that a fetch response needs to be built by
> > stitching record batches and generated control batches together?
> >
> > JR 12. Queues: Is there still a share partition leader that all consumers
> > are routed to?
> >
> > JR 13. "Should the KIPs be modified to include this or it's too
> > implementation-focused?" It would be useful to include enough details to
> > understand correctness and performance impact.
> >
> > HC5. Henry has a valid point. Requests from a given producer contain a
> > sequence number, which is ordered. If a producer sends every Produce
> > request to an arbitrary broker, those requests could reach the batch
> > coordinator in different order and lead to rejection of the produce
> > requests.
> >
> > Jun
> >
> > On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <i...@ivanyu.me> wrote:
> >
> > > Hi all,
> > >
> > > We have also thought in a bit more details about transactions and queues,
> > > here's the plan.
> > >
> > > *Transactions*
> > >
> > > The support for transactions in *classic topics* is based on precise
> > > interactions between three actors: clients (mostly producers, but also
> > > consumers), brokers (ReplicaManager and other classes), and transaction
> > > coordinators. Brokers also run partition leaders with their local state
> > > (ProducerStateManager and others).
> > >
> > > The high level (some details skipped) workflow is the following. When a
> > > transactional Produce request is received by the broker:
> > > 1. For each partition, the partition leader checks if a non-empty
> > > transaction is running for this partition. This is done using its local
> > > state derived from the log metadata (ProducerStateManager,
> > > VerificationStateEntry, VerificationGuard).
> > > 2. The transaction coordinator is informed about all the partitions that
> > > aren’t part of the transaction to include them.
> > > 3. The partition leaders do additional transactional checks.
> > > 4. The partition leaders append the transactional data to their logs and
> > > update some of their state (for example, log the fact that the
> > transaction
> > > is running for the partition and its first offset).
> > >
> > > When the transaction is committed or aborted:
> > > 1. The producer contacts the transaction coordinator directly with
> > > EndTxnRequest.
> > > 2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to
> > > its log and responds to the producer.
> > > 3. The transaction coordinator sends WriteTxnMarkersRequest to the
> > leaders
> > > of the involved partitions.
> > > 4. The partition leaders write the transaction markers to their logs and
> > > respond to the coordinator.
> > > 5. The coordinator writes the final transaction state COMPLETE_COMMIT or
> > > COMPLETE_ABORT.
> > >
> > > In classic topics, partitions have leaders and lots of important state
> > > necessary for supporting this workflow is local. The main challenge in
> > > mapping this to Diskless comes from the fact there are no partition
> > > leaders, so the corresponding pieces of state need to be globalized in
> > the
> > > batch coordinator. We are already doing this to support idempotent
> > produce.
> > >
> > > The high level workflow for *diskless topics* would look very similar:
> > > 1. For each partition, the broker checks if a non-empty transaction is
> > > running for this partition. In contrast to classic topics, this is
> > checked
> > > against the batch coordinator with a single RPC. Since a transaction
> > could
> > > be uniquely identified with producer ID and epoch, the positive result of
> > > this check could be cached locally (for the double configured duration
> > of a
> > > transaction, for example).
> > > 2. The same: The transaction coordinator is informed about all the
> > > partitions that aren’t part of the transaction to include them.
> > > 3. No transactional checks are done on the broker side.
> > > 4. The broker appends the transactional data to the current shared WAL
> > > segment. It doesn’t update any transaction-related state for Diskless
> > > topics, because it doesn’t have any.
> > > 5. The WAL segment is committed to the batch coordinator like in the
> > > normal produce flow.
> > > 6. The batch coordinator does the final transactional checks of the
> > > batches. This procedure would output the same errors like the partition
> > > leader in classic topics would do. I.e. some batches could be rejected.
> > > This means, there will potentially be garbage in the WAL segment file in
> > > case of transactional errors. This is preferable to doing more network
> > > round trips, especially considering the WAL segments will be relatively
> > > short-living (see the Greg's update above).
> > >
> > > When the transaction is committed or aborted:
> > > 1. The producer contacts the transaction coordinator directly with
> > > EndTxnRequest.
> > > 2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to
> > > its log and responds to the producer.
> > > 3. *[NEW]* The transaction coordinator informs the batch coordinator that
> > > the transaction is finished.
> > > 4. *[NEW]* The batch coordinator saves that the transaction is finished
> > > and also inserts the control batches in the corresponding logs of the
> > > involved Diskless topics. This happens only on the metadata level, no
> > > actual control batches are written to any file. They will be dynamically
> > > created on Fetch and other read operations. We could technically write
> > > these control batches for real, but this would mean extra produce
> > latency,
> > > so it's better just to mark them in the batch coordinator and save these
> > > milliseconds.
> > > 5. The transaction coordinator sends WriteTxnMarkersRequest to the
> > leaders
> > > of the involved partitions. – Now only to classic topics now.
> > > 6. The partition leaders of classic topics write the transaction markers
> > > to their logs and respond to the coordinator.
> > > 7. The coordinator writes the final transaction state COMPLETE_COMMIT or
> > > COMPLETE_ABORT.
> > >
> > > Compared to the non-transactional produce flow, we get:
> > > 1. An extra network round trip between brokers and the batch coordinator
> > > when a new partition appear in the transaction. To mitigate the impact of
> > > them:
> > >   - The results will be cached.
> > >   - The calls for multiple partitions in one Produce request will be
> > > grouped.
> > >   - The batch coordinator should be optimized for fast response to these
> > > RPCs.
> > >   - The fact that a single producer normally will communicate with a
> > > single broker for the duration of the transaction further reduces the
> > > expected number of round trips.
> > > 2. An extra round trip between the transaction coordinator and batch
> > > coordinator when a transaction is finished.
> > >
> > > With this proposal, transactions will also be able to span both classic
> > > and Diskless topics.
> > >
> > > *Queues*
> > >
> > > The share group coordination and management is a side job that doesn't
> > > interfere with the topic itself (leadership, replicas, physical storage
> > of
> > > records, etc.) and non-queue producers and consumers (Fetch and Produce
> > > RPCs, consumer group-related RPCs are not affected.) We don't see any
> > > reason why we can't make Diskless topics compatible with share groups the
> > > same way as classic topics are. Even on the code level, we don't expect
> > any
> > > serious refactoring: the same reading routines are used that are used for
> > > fetching (e.g. ReplicaManager.readFromLog).
> > >
> > >
> > > Should the KIPs be modified to include this or it's too
> > > implementation-focused?
> > >
> > > Best regards,
> > > Ivan
> > >
> > > On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > > > Hi all,
> > > >
> > > > Thank you all for your questions and design input on KIP-1150.
> > > >
> > > > We have just updated KIP-1150 and KIP-1163 with a new design. To
> > > summarize
> > > > the changes:
> > > >
> > > > 1. The design prioritizes integrating with the existing KIP-405 Tiered
> > > > Storage interfaces, permitting data produced to a Diskless topic to be
> > > > moved to tiered storage.
> > > > This lowers the scalability requirements for the Batch Coordinator
> > > > component, and allows Diskless to compose with Tiered Storage plugin
> > > > features such as encryption and alternative data formats.
> > > >
> > > > 2. Consumer fetches are now served from local segments, making use of
> > the
> > > > indexes, page cache, request purgatory, and zero-copy functionality
> > > already
> > > > built into classic topics.
> > > > However, local segments are now considered cache elements, do not need
> > to
> > > > be durably stored, and can be built without contacting any other
> > > replicas.
> > > >
> > > > 3. The design has been simplified substantially, by removing the
> > previous
> > > > Diskless consume flow, distributed cache component, and "object
> > > > compaction/merging" step.
> > > >
> > > > The design maintains leaderless produces as enabled by the Batch
> > > > Coordinator, and the same latency profiles as the earlier design, while
> > > > being simpler and integrating better into the existing ecosystem.
> > > >
> > > > Thanks, and we are eager to hear your feedback on the new design.
> > > > Greg Harris
> > > >
> > > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao <j...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Jan,
> > > > >
> > > > > For me, the main gap of KIP-1150 is the support of all existing
> > client
> > > > > APIs. Currently, there is no design for supporting APIs like
> > > transactions
> > > > > and queues.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > > > > <jan.siekier...@kentra.io.invalid> wrote:
> > > > >
> > > > > > Would it be a good time to ask for the current status of this KIP?
> > I
> > > > > > haven't seen much activity here for the past 2 months, the vote got
> > > > > vetoed
> > > > > > but I think the pending questions have been answered since then.
> > > KIP-1183
> > > > > > (AutoMQ's proposal) also didn't have any activity since May.
> > > > > >
> > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices that can be
> > > > > > made, with a coordinator-based approach being by far the dominant
> > one
> > > > > when
> > > > > > it comes to market adoption - but all these are standalone
> > products.
> > > > > >
> > > > > > I'm a big fan of both approaches, but would hate to see a stall. So
> > > the
> > > > > > question is: can we get an update?
> > > > > >
> > > > > > Maybe it's time to start another vote? Colin McCabe - have your
> > > questions
> > > > > > been answered? If not, is there anything I can do to help? I'm
> > deeply
> > > > > > familiar with both architectures and have written about both?
> > > > > >
> > > > > > Kind regards,
> > > > > > Jan
> > > > > >
> > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski <
> > > > > > stanislavkozlov...@apache.org> wrote:
> > > > > >
> > > > > > > I have some nits - it may be useful to
> > > > > > >
> > > > > > > a) group all the KIP email threads in the main one (just a bunch
> > of
> > > > > links
> > > > > > > to everything)
> > > > > > > b) create the email threads
> > > > > > >
> > > > > > > It's a bit hard to track it all - for example, I was searching
> > for
> > > a
> > > > > > > discuss thread for KIP-1165 for a while; As far as I can tell, it
> > > > > doesn't
> > > > > > > exist yet.
> > > > > > >
> > > > > > > Since the KIPs are published (by virtue of having the root KIP be
> > > > > > > published, having a DISCUSS thread and links to sub-KIPs where
> > were
> > > > > aimed
> > > > > > > to move the discussion towards), I think it would be good to
> > create
> > > > > > DISCUSS
> > > > > > > threads for them all.
> > > > > > >
> > > > > > > Best,
> > > > > > > Stan
> > > > > > >
> > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > > > > > Hi Kafka Devs!
> > > > > > > >
> > > > > > > > We want to start a new KIP discussion about introducing a new
> > > type of
> > > > > > > > topics that would make use of Object Storage as the primary
> > > source of
> > > > > > > > storage. However, as this KIP is big we decided to split it
> > into
> > > > > > multiple
> > > > > > > > related KIPs.
> > > > > > > > We have the motivational KIP-1150 (
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > > > > )
> > > > > > > > that aims to discuss if Apache Kafka should aim to have this
> > > type of
> > > > > > > > feature at all. This KIP doesn't go onto details on how to
> > > implement
> > > > > > it.
> > > > > > > > This follows the same approach used when we discussed KRaft.
> > > > > > > >
> > > > > > > > But as we know that it is sometimes really hard to discuss on
> > > that
> > > > > meta
> > > > > > > > level, we also created several sub-kips (linked in KIP-1150)
> > that
> > > > > offer
> > > > > > > an
> > > > > > > > implementation of this feature.
> > > > > > > >
> > > > > > > > We kindly ask you to use the proper DISCUSS threads for each
> > > type of
> > > > > > > > concern and keep this one to discuss whether Apache Kafka wants
> > > to
> > > > > have
> > > > > > > > this feature or not.
> > > > > > > >
> > > > > > > > Thanks in advance on behalf of all the authors of this KIP.
> > > > > > > >
> > > > > > > > ------------------
> > > > > > > > Josep Prat
> > > > > > > > Open Source Engineering Director, Aiven
> > > > > > > > josep.p...@aiven.io   |   +491715557497 | aiven.io
> > > > > > > > Aiven Deutschland GmbH
> > > > > > > > Alexanderufer 3-7, 10117 Berlin
> > > > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > > > > > > > Anna Richardson, Kenneth Chen
> > > > > > > > Amtsgericht Charlottenburg, HRB 209739 B
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to