Hi Everyone,

Following up on the KIP-1150 vote thread, I'm moving my questions regarding
Exactly-Once Semantics (EOS) over here since they fall squarely into
KIP-1164's domain.

Decoupling storage is a massive win for cross-AZ costs, but shifting to a
leaderless data plane inherently decentralizes the transaction state
machine. To ensure we don't accidentally introduce split-brain scenarios or
break read_committed isolation, we need to explicitly define the
synchronization barriers.

Here are three areas where the current design needs tighter specs, along
with some proposed architectural patterns to solve them:

1. LSO Calculation via Materialized Views: In standard Kafka, the partition
leader is the single source of truth. It tracks in-flight transactions via
the ProducerStateManager and computes the Last Stable Offset (LSO) in
memory. With diskless, the Batch Coordinator takes over this role.

The Gap: If the Batch Coordinator is handling LSO for a huge number of
multiplexed partitions, it risks becoming a severe bottleneck.

Proposed Design: I recommend we explicitly frame the  _diskless-metadata
topic as an immutable Event Store. The coordinator's embedded SQLite
database should act purely as a materialized view (projection) over this
event stream. This projection would maintain a continuously updated index
of active PIDs, allowing us to dynamically resolve the LSO in O(1) time
without requiring the coordinator to scan unbounded transaction logs.

2. Cross-Coordinator RPC & The Commit Barrier:  When the Transaction
Coordinator (TC) decides to commit, it needs to verify that all data
batches for that transaction epoch are actually in place and sequenced.

The Gap: The KIP currently lacks a defined RPC handshake between the TC and
the Batch Coordinator. What happens if a CommitBatchCoordinates call is
still in flight when the TC tries to write the commit marker?

Proposed Design: We need to explicitly document a strict "Commit Barrier."
Before writing the commit marker, the Batch Coordinator must
deterministically verify it has received contiguous sequence numbers for
the whole epoch. If there are pending asynchronous payloads, the commit
marker must be blocked at this barrier until they resolve or definitively
time out.

3. The Zombie Broker Problem & Fencing Tokens: This is the edge case that
worries me the most. Look into this: a broker uploads a batch to S3, but
then gets hit with a severe GC pause before it can send the metadata commit
to the Batch Coordinator. Meanwhile, the transaction timeout and the TC
rolls the epoch forward.

The Gap: When the broker finally wakes up, it sends its delayed metadata
commit. If the Batch Coordinator accepts it, we've just merged stale data
into a transaction that's already been marked as aborted or committed ;a
direct EOS violation.

Proposed Design: Probabilistic timeouts won't fix this; we need
deterministic correctness. Every metadata commit should include a monotonic
BrokerEpoch acting as a fencing token. The Batch Coordinator must validate
this token against the latest known cluster state and immediately reject
anything from a stale epoch.

Locking down these public interfaces and state transitions in the text will
give the community the confidence needed to implement this safely.

Happy to dig into the code or discuss further if it helps clarify any of
this.looking forward to hearing your thoughts on how we handle these
synchronization barriers.


Regards,
Viquar Khan
*Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
*Book *-
https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
*GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
*Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
*github*-https://github.com/vaquarkhan

Reply via email to