Thanks Josep, let me move my questions and propose design into proper thread.
Regards, Viquar Khan On Mon, 2 Mar 2026 at 01:59, Josep Prat <[email protected]> wrote: > Hi Viquar, > > Thanks for your comments and participating in the KIP process. In order > for your comments to be registered properly, you have to use the proper > DISCUSS threads for each KIP. This way, we have a singular centralized > archive for discussions, and votes. > For KIP-1164, you can find the existing DISCUSS thread here [1]. The > detailed process for Kafka Improvement Proposals is also available for > reference [2]. > > > [1] https://lists.apache.org/thread/m9l6lbqv2cffxtz5frypylmqjd7bsqoz > [2] > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process > > Best, > > Josep Prat, > PMC Member of Apache Kafka. > > On 2026/03/01 20:26:24 vaquar khan wrote: > > Hi Everyone, > > > > Following up on the KIP-1150 vote thread, I'm moving my questions > regarding > > Exactly-Once Semantics (EOS) over here since they fall squarely into > > KIP-1164's domain. > > > > Decoupling storage is a massive win for cross-AZ costs, but shifting to a > > leaderless data plane inherently decentralizes the transaction state > > machine. To ensure we don't accidentally introduce split-brain scenarios > or > > break read_committed isolation, we need to explicitly define the > > synchronization barriers. > > > > Here are three areas where the current design needs tighter specs, along > > with some proposed architectural patterns to solve them: > > > > 1. LSO Calculation via Materialized Views: In standard Kafka, the > partition > > leader is the single source of truth. It tracks in-flight transactions > via > > the ProducerStateManager and computes the Last Stable Offset (LSO) in > > memory. With diskless, the Batch Coordinator takes over this role. > > > > The Gap: If the Batch Coordinator is handling LSO for a huge number of > > multiplexed partitions, it risks becoming a severe bottleneck. > > > > Proposed Design: I recommend we explicitly frame the _diskless-metadata > > topic as an immutable Event Store. The coordinator's embedded SQLite > > database should act purely as a materialized view (projection) over this > > event stream. This projection would maintain a continuously updated index > > of active PIDs, allowing us to dynamically resolve the LSO in O(1) time > > without requiring the coordinator to scan unbounded transaction logs. > > > > 2. Cross-Coordinator RPC & The Commit Barrier: When the Transaction > > Coordinator (TC) decides to commit, it needs to verify that all data > > batches for that transaction epoch are actually in place and sequenced. > > > > The Gap: The KIP currently lacks a defined RPC handshake between the TC > and > > the Batch Coordinator. What happens if a CommitBatchCoordinates call is > > still in flight when the TC tries to write the commit marker? > > > > Proposed Design: We need to explicitly document a strict "Commit > Barrier." > > Before writing the commit marker, the Batch Coordinator must > > deterministically verify it has received contiguous sequence numbers for > > the whole epoch. If there are pending asynchronous payloads, the commit > > marker must be blocked at this barrier until they resolve or definitively > > time out. > > > > 3. The Zombie Broker Problem & Fencing Tokens: This is the edge case that > > worries me the most. Look into this: a broker uploads a batch to S3, but > > then gets hit with a severe GC pause before it can send the metadata > commit > > to the Batch Coordinator. Meanwhile, the transaction timeout and the TC > > rolls the epoch forward. > > > > The Gap: When the broker finally wakes up, it sends its delayed metadata > > commit. If the Batch Coordinator accepts it, we've just merged stale data > > into a transaction that's already been marked as aborted or committed ;a > > direct EOS violation. > > > > Proposed Design: Probabilistic timeouts won't fix this; we need > > deterministic correctness. Every metadata commit should include a > monotonic > > BrokerEpoch acting as a fencing token. The Batch Coordinator must > validate > > this token against the latest known cluster state and immediately reject > > anything from a stale epoch. > > > > Locking down these public interfaces and state transitions in the text > will > > give the community the confidence needed to implement this safely. > > > > Happy to dig into the code or discuss further if it helps clarify any of > > this.looking forward to hearing your thoughts on how we handle these > > synchronization barriers. > > > > > > Regards, > > Viquar Khan > > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ > > *Book *- > > > https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true > > *GitBook*- > https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ > > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan > > *github*-https://github.com/vaquarkhan > > >
