Re: KIP-1164: Batch Coordinator & Exactly-Once Semantics (EOS) Data Safety

vaquar khan Mon, 02 Mar 2026 08:16:59 -0800

Thanks Josep, let me move my questions and propose design into proper
thread.


Regards,
Viquar Khan

On Mon, 2 Mar 2026 at 01:59, Josep Prat <[email protected]> wrote:

> Hi Viquar,
>
> Thanks for your comments and participating in the KIP process. In order
> for your comments to be registered properly, you have to use the proper
> DISCUSS threads for each KIP. This way, we have a singular centralized
> archive for discussions, and votes.
> For KIP-1164, you can find the existing DISCUSS thread here [1]. The
> detailed process for Kafka Improvement Proposals is also available for
> reference [2].
>
>
> [1] https://lists.apache.org/thread/m9l6lbqv2cffxtz5frypylmqjd7bsqoz
> [2]
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process
>
> Best,
>
> Josep Prat,
> PMC Member of Apache Kafka.
>
> On 2026/03/01 20:26:24 vaquar khan wrote:
> > Hi Everyone,
> >
> > Following up on the KIP-1150 vote thread, I'm moving my questions
> regarding
> > Exactly-Once Semantics (EOS) over here since they fall squarely into
> > KIP-1164's domain.
> >
> > Decoupling storage is a massive win for cross-AZ costs, but shifting to a
> > leaderless data plane inherently decentralizes the transaction state
> > machine. To ensure we don't accidentally introduce split-brain scenarios
> or
> > break read_committed isolation, we need to explicitly define the
> > synchronization barriers.
> >
> > Here are three areas where the current design needs tighter specs, along
> > with some proposed architectural patterns to solve them:
> >
> > 1. LSO Calculation via Materialized Views: In standard Kafka, the
> partition
> > leader is the single source of truth. It tracks in-flight transactions
> via
> > the ProducerStateManager and computes the Last Stable Offset (LSO) in
> > memory. With diskless, the Batch Coordinator takes over this role.
> >
> > The Gap: If the Batch Coordinator is handling LSO for a huge number of
> > multiplexed partitions, it risks becoming a severe bottleneck.
> >
> > Proposed Design: I recommend we explicitly frame the  _diskless-metadata
> > topic as an immutable Event Store. The coordinator's embedded SQLite
> > database should act purely as a materialized view (projection) over this
> > event stream. This projection would maintain a continuously updated index
> > of active PIDs, allowing us to dynamically resolve the LSO in O(1) time
> > without requiring the coordinator to scan unbounded transaction logs.
> >
> > 2. Cross-Coordinator RPC & The Commit Barrier:  When the Transaction
> > Coordinator (TC) decides to commit, it needs to verify that all data
> > batches for that transaction epoch are actually in place and sequenced.
> >
> > The Gap: The KIP currently lacks a defined RPC handshake between the TC
> and
> > the Batch Coordinator. What happens if a CommitBatchCoordinates call is
> > still in flight when the TC tries to write the commit marker?
> >
> > Proposed Design: We need to explicitly document a strict "Commit
> Barrier."
> > Before writing the commit marker, the Batch Coordinator must
> > deterministically verify it has received contiguous sequence numbers for
> > the whole epoch. If there are pending asynchronous payloads, the commit
> > marker must be blocked at this barrier until they resolve or definitively
> > time out.
> >
> > 3. The Zombie Broker Problem & Fencing Tokens: This is the edge case that
> > worries me the most. Look into this: a broker uploads a batch to S3, but
> > then gets hit with a severe GC pause before it can send the metadata
> commit
> > to the Batch Coordinator. Meanwhile, the transaction timeout and the TC
> > rolls the epoch forward.
> >
> > The Gap: When the broker finally wakes up, it sends its delayed metadata
> > commit. If the Batch Coordinator accepts it, we've just merged stale data
> > into a transaction that's already been marked as aborted or committed ;a
> > direct EOS violation.
> >
> > Proposed Design: Probabilistic timeouts won't fix this; we need
> > deterministic correctness. Every metadata commit should include a
> monotonic
> > BrokerEpoch acting as a fencing token. The Batch Coordinator must
> validate
> > this token against the latest known cluster state and immediately reject
> > anything from a stale epoch.
> >
> > Locking down these public interfaces and state transitions in the text
> will
> > give the community the confidence needed to implement this safely.
> >
> > Happy to dig into the code or discuss further if it helps clarify any of
> > this.looking forward to hearing your thoughts on how we handle these
> > synchronization barriers.
> >
> >
> > Regards,
> > Viquar Khan
> > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
> > *Book *-
> >
> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
> > *GitBook*-
> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
> > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
> > *github*-https://github.com/vaquarkhan
> >
>

Re: KIP-1164: Batch Coordinator & Exactly-Once Semantics (EOS) Data Safety

Reply via email to