[DISCUSS] PIP-471: Metadata-Driven Transactions for Scalable Topics

Matteo Merli Tue, 05 May 2026 17:28:34 -0700

https://github.com/apache/pulsar/pull/25693


------


# PIP-471: Metadata-Driven Transactions for Scalable Topics

*Sub-PIP of [PIP-460: Scalable Topics](pip-460.md)*

## Background

### Pulsar's existing transaction model

Pulsar transactions today are realized through three components:

- **Transaction Coordinator (TC)** — a per-broker service backed by a
system topic (`__transaction_log_*` in `pulsar/system`) that tracks
the lifecycle of every transaction (`OPEN`, `COMMITTING`, `COMMITTED`,
`ABORTING`, `ABORTED`, `TIME_OUT`) and orchestrates two-phase commit
across the topics that participate in each transaction.
- **TransactionBuffer (TB)** — a per-`PersistentTopic` component that
buffers transactional writes in the topic's data stream, tracks
aborted transaction IDs, and gates the dispatcher's read horizon
(`maxReadPosition`) so that uncommitted entries are not delivered. The
TB persists its state in a per-namespace system topic
(`__transaction_buffer_snapshot`).
- **PendingAckStore** — a per-(topic, subscription) component that
records transactional acknowledgments in a sibling persistent topic
(`<topic>-<sub>__transaction_pending_ack`), applying them to the
cursor only when the transaction commits.

When a transaction ends, the TC sends `END_TXN_ON_PARTITION` (and
`END_TXN_ON_SUBSCRIPTION` for acks) to every participant. The TB then
writes a **commit or abort marker** as a regular entry in the topic's
managed ledger. The dispatcher discovers committed/aborted state by
replaying these markers and consulting the in-memory aborted-txn set.

### Scalable topics

[PIP-460](pip-460.md) introduces scalable topics: a logical topic
backed by a DAG of range segments (`segment://...`) that can be split
or merged at runtime. Each segment is a regular `PersistentTopic` from
the broker's perspective, but the segment's lifetime is controlled by
the [scalable topic controller](pip-468.md) — segments get **sealed**
when split or merged, after which the segment's managed ledger no
longer accepts writes.

### How the two interact

The current transaction implementation composes per-`PersistentTopic`.
With scalable topics, every segment carries its own TB. This
composition fails in two ways:

1. **End-of-transaction stalls on sealed segments.** The TC sends
`END_TXN_ON_PARTITION` to each segment that received writes. The
segment's TB tries to append a commit/abort marker — which is a write
— and the now-sealed segment rejects it. The end-txn RPC times out
(~30s).
2. **Pending-ack topic naming collides with the segment-domain
parser.** The convention `<topic>-<sub>__transaction_pending_ack` is
unparseable when `<topic>` is a `segment://...` URI. (Worked around in
#25631 with a flat persistent name; see "Out of Scope" below.)

The first issue is structural, not just a routing bug. As long as
commit/abort decisions need to be persisted **inside the topic's data
stream**, sealing the topic terminates any in-flight transaction.

---

## Motivation

We need transactions that:

1. Provide atomicity across multiple writes and acknowledgments,
possibly spanning multiple topics across multiple namespaces.
2. Compose correctly with the scalable-topic lifecycle — including
splits, merges, and segments sealed mid-transaction.
3. Do not require duplicating data (each `producer.send` produces a
single managed-ledger append).
4. Reuse as much of the existing transaction surface as possible —
interfaces, dispatcher integration, client API — so that we are not
re-litigating well-understood concerns.
5. Coexist with v4 transactions on `persistent://` topics with no
behavior change for those topics.

The structural mismatch between in-stream markers and a mutable
segment DAG cannot be papered over at the routing or the topic-naming
layer. It needs a transaction representation that does not put the
decision record inside the data stream.

---

## Goals

### In Scope

- Atomic transactions over `segment://` topics (writes and acks),
including transactions whose lifetime spans split/merge.
- Multi-topic, multi-namespace, multi-segment transactions with the
same atomicity guarantees as today.
- Reuse of the existing `Transaction`, `TransactionCoordinator`,
`TransactionBuffer`, `PendingAckStore`, dispatcher, and client APIs.
New behavior arrives as alternative implementations behind the
existing interfaces.
- Coexistence with the legacy in-stream-marker implementation for
`persistent://` topics.

### Out of Scope

- Replacing the legacy implementation for non-scalable topics. The new
implementation is opt-in per topic; `persistent://` topics keep their
current behavior, including the existing TC.
- Replacing the segment-aware pending-ack topic name introduced in
#25631 — that workaround becomes unnecessary as a side effect of this
PIP and is removed in the same change.
- Cross-cluster (geo-replicated) transactional semantics.

---

## High Level Design

The proposal is one sentence:

> **Move transactional state out of the data stream and into the metadata 
> store.**

Concretely: keep all existing components and interfaces, and add a
parallel implementation of `TransactionBuffer`, `PendingAckStore`,
**and Transaction Coordinator** that writes nothing to any data
stream. Their state lives entirely in the metadata store. The legacy
in-stream-marker components remain, unchanged, for `persistent://`
topics; the new metadata-driven components handle `segment://` topics.
The dispatcher's contract is unchanged.

Why introduce a v5 TC rather than reuse the legacy one: the legacy TC
stores its log in a system topic (`__transaction_log_*`), which
carries the operational concerns of any system topic — compaction can
lead to long recovery times, leadership has to be maintained, and
recovery is on the data path. With the metadata store available we can
have a TC whose state is just a few key-value records, no log, no
system topic, no per-broker in-memory replay. Running both TC
implementations in parallel keeps v4 transactions byte-for-byte
unchanged while the v5 path uses the simpler design.

### Why this works for scalable topics

- **Sealing a segment is irrelevant.** Commit/abort no longer require
any append to the segment. End-txn becomes a metadata-store CAS on a
single record. Sealed segments materialize the decision (advance
cursors, evict cache entries) without writing anything.
- **The dispatcher does not change.** It already asks the topic's TB
for `maxReadPosition` and `isTxnAborted`. We swap the source.
- **Splits/merges do not strand transactions.** Sealed parents and
live children both consult the same metadata; the decision lives above
the segments.
- **No data is duplicated.** Each transactional `send` produces
exactly one managed-ledger append, same as today.

### Architecture Overview

```
┌──────────────────────────────────────────────────────────────────┐
│   Client (V5)  -- producer.send(txn,...)                          │
│                -- consumer.acknowledge(id, txn)                   │
└─────────────────────────────────┬─────────────────────────────────┘
                                  │
            ┌─────────────────────┴─────────────────────┐
            │                                            │
┌───────────▼────────────────┐         ┌────────────────▼──────────────┐
│   Transaction Coordinator   │         │   Transaction Coordinator V5   │
│   (legacy, BK-backed log)   │         │   (metadata-store records)     │
│   → v4 / persistent:// txns │         │   → v5 / segment:// txns       │
└───────────┬────────────────┘         └────────────────┬──────────────┘
            │                                            │
            │ END_TXN_ON_PARTITION / SUBSCRIPTION        │
            ▼                                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│   Per-topic broker components                                        │
│                                                                      │
│   ┌──────────────────────────┐  ┌────────────────────────┐          │
│   │ TopicTransactionBuffer    │  │ MLPendingAckStore      │          │
│   │ (in-stream markers)       │  │ (sibling topic)        │          │
│   │  → persistent:// topics   │  │  → persistent:// topics│          │
│   └──────────────────────────┘  └────────────────────────┘          │
│                                                                      │
│   ┌──────────────────────────┐  ┌────────────────────────┐          │
│   │ MetadataTransactionBuffer │  │ MetadataPendingAckStore│          │
│   │ (metadata-store records)  │  │ (metadata-store records)          │
│   │  → segment:// topics      │  │  → segment:// topics   │          │
│   └────────┬─────────────────┘  └─────────┬──────────────┘          │
└────────────┼──────────────────────────────┼─────────────────────────┘
             │                              │
             ▼                              ▼
           Metadata Store — txn coordinator state + txn-op records +
secondary indexes
```

The `TransactionBufferProvider` and
`TransactionPendingAckStoreProvider` SPIs already exist. The new TB /
PendingAckStore implementations slot in behind them. The v5 TC is a
parallel coordinator selected by the client when it is configured for
the new path. Selection on the participant side is per-topic, based on
the topic's domain.

---

## Detailed Design

### Data Model

The metadata store holds two classes of records and four secondary
indexes. All records for a given transaction share the same
**partition key** (`txnId`) so they are co-located — this makes
per-txn scans (e.g. listing all ops to apply at end-txn time) a
single-partition operation rather than a fan-out.

> **A note on metadata-store backends.** The design is 
> `MetadataStore`-agnostic. It depends on three capabilities — partition-key 
> co-location, sequential keys, and secondary indexes with range queries and 
> range-watch — that the `MetadataStore` interface does not expose today. We 
> extend the interface to surface them; backends that natively support these 
> (notably Oxia, the intended default) implement them directly, while backends 
> that don't (e.g. ZooKeeper) can implement them in a less efficient way 
> (client-side counters for sequential IDs; client-maintained index records; 
> periodic re-list in lieu of range-watch). Correctness does not depend on 
> backend choice; throughput and recovery latency may.

#### Header — one per transaction. Linearization point.

```
/txn/<txnId>                        partitionKey = txnId
  =  {
       state:       OPEN | COMMITTED | ABORTED,
       timeout_ms:  <abs epoch ms>,
       created_ms:  <abs epoch ms>
     }
```

State transitions are conditional puts (CAS on version) issued by the
v5 TC. `OPEN → COMMITTED` and `OPEN → ABORTED` are the only allowed
transitions; `COMMITTED` and `ABORTED` are terminal.

#### Operation records — one per transactional write or ack. Unbounded.

```
/txn-op/<txnId>/<seq>               partitionKey = txnId,
                                    sequential   = true     #
server-assigned <seq>
  =  {
       kind:         "write" | "ack",
       segment:      "segment://t/n/x/<descriptor>",  # always present
       subscription: "<sub-fqn>",                     # ack only
       position:     <ledgerId>:<entryId>
     }
```

Each operation is its own record, so a transaction has no size limit
and concurrent participants do not contend on a single record. With
**sequential keys** the server (or, on backends that lack them, a
`MetadataStore`-side counter) assigns `<seq>`, eliminating client-side
collisions.

#### Secondary indexes (auto-maintained by the metadata store)

```
idx:writes-by-segment              on /txn-op/* where kind=write
                                   key = segment
                                   →  range query "writes touching segment S"

idx:acks-by-segment-subscription   on /txn-op/* where kind=ack
                                   key = (segment, subscription)
                                   →  range query "acks on (segment S,
subscription SU)"

idx:txn-by-deadline                on /txn/* where state=OPEN
                                   key = timeout_ms
                                   →  range query "open txns past deadline"
                                   →  used by TC for timeout-driven abort

idx:txn-by-final-state             on /txn/* where state ∈ {COMMITTED, ABORTED}
                                   key = (state, finalized_ms)
                                   →  range query "finalized txns ready for GC"
                                   →  used by GC sweep to find
finalized txns whose op records can be deleted
```

#### Garbage collection

A finalized transaction (`COMMITTED` or `ABORTED`) is removed in two phases:

1. **Per-participant materialization.** When the TC fans out end-txn,
each participant broker materializes the decision (commit: advance
subscription cursors for acks, evict header cache; abort: drop ops).
Once a participant has finished its materialization for `<txnId>`, it
deletes its op records (`/txn-op/<txnId>/<seq>` for ops it owns).
2. **Header GC sweep.** A periodic sweep scans
`idx:txn-by-final-state` for entries past a configurable retention
window (e.g. 60 s after `finalized_ms`). For each, it verifies no
`/txn-op/<txnId>/*` records remain (orphan check from a participant
crash), forces deletion of any leftovers, and finally deletes the
header `/txn/<txnId>`.

Because all of a txn's records share the same partition (`partitionKey
= txnId`), the GC sweep's per-txn cleanup stays in one partition: list
`/txn-op/<txnId>/`, delete, then delete the header.

Indexes update transactionally with the underlying records, so they self-clean.

### Components

#### `MetadataTransactionBuffer` (new)

Implements the existing `TransactionBuffer` interface. Used for
`segment://` topics.

| Method | Behavior |
|---|---|
| `appendBufferToTxn(txnId, buf)` | `ML.asyncAddEntry(buf)`; on
success, append a sequential `/txn-op/<txnId>/<seq>`
(`partitionKey=txnId`) with `kind="write", segment, position`. The
publish ack waits for both. |
| `commit(txnId, position)` / `abort(...)` | Not invoked by the v5 TC
(which does not RPC participants). The TB's header watch fires when
`/txn/<txnId>.state` changes; the TB then materializes locally (evict
/ mark-aborted) and deletes its owned op records. |
| `getMaxReadPosition()` | Read from in-memory cache. Cache is
populated by a watch on `idx:writes-by-segment == <my-segment>` joined
against the header cache. Result: `min(position over OPEN txns) - 1`,
capped at LAC. |
| `isTxnAborted(msg)` | Look up `/txn/<txnId>.state` from header cache. |
| `recover()` | Open the index watch and the header cache; populate
from the current snapshot. No log replay, no snapshot topic. |

#### `MetadataPendingAckStore` (new)

Implements the existing `PendingAckStore` interface. Used for
`segment://` topic subscriptions.

| Method | Behavior |
|---|---|
| `appendIndividualAck(txnId, positions)` | Append sequential
`/txn-op/<txnId>/<seq>` records with `kind="ack", segment,
subscription, position`. |
| `appendCumulativeAck(...)` | Same shape, single op record carrying
the cumulative position. |
| `commit(txnId)` / `abort(txnId)` | Not invoked by the v5 TC.
Triggered locally when the header watch on `/txn/<txnId>.state` fires.
Commit: range-query `idx:acks-by-segment-subscription ==
(<my-segment>, <my-subscription>)` filtered to `<txnId>`; apply to
cursor (`markDelete` or `individualAck`); range-delete the op records.
Abort: range-delete the op records, no cursor work. |
| `replayAsync()` (recovery) | Range-query
`idx:acks-by-segment-subscription == (<my-segment>,
<my-subscription>)`, group by `txnId`, hydrate in-memory state. |

#### Transaction Coordinator V5 (new)

A parallel coordinator selected by the v5 client. Same client-facing
wire commands (`NEW_TXN`, `ADD_PARTITION_TO_TXN`,
`ADD_SUBSCRIPTION_TO_TXN`, `END_TXN`), but no system-topic log: every
operation reads or CAS's a metadata-store record. **The TC does not
RPC participants** — see "Notification mechanism" below.

| Operation | Behavior |
|---|---|
| `newTxn(timeoutMs)` | Create `/txn/<txnId>` with `state=OPEN`,
`timeout_ms=now+timeoutMs`. |
| `addPartitionToTxn` / `addSubscriptionToTxn` | No-op at the
coordinator. The participant broker writes its own op records when the
actual write/ack arrives; the TC never needs to enumerate
participants. |
| `endTxn(COMMIT\|ABORT)` | A single CAS on `/txn/<txnId>.state`.
After it returns, the TC sets `finalized_ms` on the header and acks
the client. No fan-out, no waiting on participants. |
| Timeout sweep | Range-query `idx:txn-by-deadline` for entries with
`timeout_ms ≤ now`, abort each (same single-CAS flow). |
| GC sweep | Range-query `idx:txn-by-final-state` for entries past
retention; for each, verify `/txn-op/<txnId>/*` is empty (force-delete
leftovers); delete header. |

Why parallel rather than reusing the legacy TC: the legacy TC's
per-shard system topic (`__transaction_log_*`) requires leadership
election, runs compaction over its own log, and pays a recovery cost
on every broker restart proportional to the live transaction set. The
v5 TC's state is just per-txn KV records — there is no log to compact
and no cold-start replay. Running both in parallel keeps v4
transactions byte-for-byte unchanged while the v5 path uses the
simpler design. A v5 client routes its `NEW_TXN` to the v5 TC; v4
clients route to the legacy TC. A single transaction does not span the
two.

#### Notification mechanism (TC → participants)

The legacy TC needs to RPC each participant (`END_TXN_ON_PARTITION`,
`END_TXN_ON_SUBSCRIPTION`) because the participants have no other way
to learn the decision — the TC's log is the only source of truth, and
only the TC reads it.

In the v5 design **the metadata store is the source of truth**, and
every participant already reads from it. Participants therefore learn
about state transitions directly from the store, without any
TC-to-broker RPC:

- A `MetadataTransactionBuffer` keeps an in-memory header cache for
txns it has writes from. The cache entries are populated when a write
op record is appended (the broker reads the header to authorize the
write) and **kept up to date by point-watches on the headers it has
cached**.
- A `MetadataPendingAckStore` maintains the same pattern for txns it
has acks from.
- When the TC CAS's `/txn/<txnId>.state` from OPEN →
COMMITTED/ABORTED, every cached watcher fires. Each participant
materializes locally:
  - **Commit** — TB evicts its cache entry (the txn no longer pins
`maxReadPosition` back); PendingAckStore applies the buffered acks to
the cursor.
  - **Abort** — TB marks the txn aborted in its cache (the
dispatcher's `isTxnAborted` will skip those entries); PendingAckStore
drops the buffered acks.
- After materialization, the participant deletes the op records it
owns (`/txn-op/<txnId>/<seq>` for ops on its segment / subscription).
- The TC's GC sweep (above) detects when all participants have done
their cleanup — the prefix `/txn-op/<txnId>/*` is empty — and deletes
the header.

Consequences:

- **End-txn latency.** From the client's perspective, `commit` returns
as soon as the header CAS lands. From a consumer's perspective,
freshly-committed entries become visible after the participant's
header watch fires + materialization runs. That's typically tens of
milliseconds; bounded by metadata-store watch propagation. (If we ever
care about a tighter bound — e.g. for a given workload — the TC can
issue an optional `nudge` RPC to participants in parallel with the
CAS. Not needed for correctness; not in this PIP.)
- **No RPC fan-out from TC.** End-txn is `O(1)` work at the TC: one
CAS. The cost of fan-out is paid by the metadata store's
watch-delivery infrastructure, which already exists for other Pulsar
uses.
- **Crash idempotence.** A participant that crashes during
materialization restarts, observes the (already-final) header state
via its watch, and finishes materialization. The TC need not retry
anything.

#### Dispatcher

Unchanged. It already asks
`topic.getTransactionBuffer().getMaxReadPosition()` and
`topic.getTransactionBuffer().isTxnAborted(...)`. The new TB
implements both.

### Flows

#### Publish (transactional)

```mermaid
sequenceDiagram
    participant C as Client
    participant B as Segment broker
    participant ML as Managed Ledger
    participant M as Metadata Store

    C->>B: send(txnId, payload)
    B->>M: read /txn/<txnId>.state  (cached)
    alt state != OPEN
        B-->>C: TxnConflict
    else state == OPEN
        B->>ML: asyncAddEntry(payload)
        ML-->>B: position
        B->>M: put /txn-op/<txnId>/<seq> {kind=write, segment, position}
        M-->>B: ack
        B-->>C: send-ack
    end
```

The header read is cache-first; the cache is invalidated by the same
watch the TB already maintains on the header. The op-record put is the
only synchronous metadata-store write on the publish path.

#### End-txn (commit or abort)

```mermaid
sequenceDiagram
    participant Cl as Client
    participant TC as Transaction Coordinator V5
    participant M as Metadata Store
    participant P as Participant brokers

    Cl->>TC: commit(txnId)
    TC->>M: CAS /txn/<txnId>.state OPEN→COMMITTED, set finalized_ms
    M-->>TC: ack
    TC-->>Cl: ack

    Note over M,P: Independently, asynchronously:
    M-->>P: header watch fires
    P->>P: materialize (cursor advance / cache evict)
    P->>M: delete owned /txn-op/<txnId>/<seq> records

    Note over TC,M: Later, GC sweep:
    TC->>M: list /txn-op/<txnId>/*  (empty? then delete header)
```

The CAS on the header is the linearization point — that is when the
transaction's outcome is decided. Notification of participants is not
part of the linearization; it propagates via the watches every
participant already maintains on the headers it has cached. Sealed
segments are fine — materialization is metadata + cursor work, no
managed-ledger writes.

#### Subscribe / dispatch

Unchanged. The dispatcher polls `tb.getMaxReadPosition()` and filters
by `tb.isTxnAborted(msg)`. The `MetadataTransactionBuffer` answers
both from its in-memory caches, fed by metadata-store watches.

#### Late-write race

The TC is mid-end-txn when the client publishes once more inside the
same transaction. The header CAS may have already flipped to
`COMMITTED`/`ABORTED`. The publish-path header check on the
participant broker rejects with `TxnConflict`. This mirrors today's TC
behavior (the TC marks transactions as ENDING and brokers reject new
writes); the only difference is that the rejection criterion is now
read from the metadata store rather than from a TC RPC.

### Recovery

- **Broker startup.** Each `MetadataTransactionBuffer` opens its index
watch and header cache. The first watch event delivers the snapshot;
the TB is ready as soon as the snapshot has been applied. No log
replay, no system-topic reader, no snapshot topic.
- **Broker crash mid-publish.** If the broker appended the entry but
crashed before writing the op record, the entry exists in the segment
but no metadata claims it. On txn timeout the TC aborts the txn; the
dispatcher's `isTxnAborted` check (which falls back to "abort" for
unknown txnIds at retention horizon) discards the entry.
- **Broker crash mid-end-txn.** If the header CAS landed but
materialization on a participant did not complete, the participant
re-derives state from the header on restart and finishes
materialization. End-txn is idempotent.
- **TC failover.** The v5 TC has no in-memory log to replay — its
state lives in the metadata store. Whichever broker takes over
coordinator duty for a TC partition resumes operations directly from
the metadata-store records. Cold-start cost is bounded by
`idx:txn-by-deadline` and `idx:txn-by-final-state` scans, not by
replay of an entire transaction log.

### Concurrency and contention

- Each transactional publish writes a unique `/txn-op/<txnId>/<seq>`
record (server-assigned sequential key). There is no contention
between concurrent participants of the same transaction.
- The header is CAS'd at most twice per transaction lifetime (open +
finalize), so contention there is bounded.
- All records for a given txn share `partitionKey=txnId`, so per-txn
operations (list, range-delete) stay on a single partition.
- Index updates are managed by the metadata store; their scaling is
the store's concern.

---

## Public-facing Changes

### Public API

No changes. The client-facing `Transaction` API is unchanged.

### Binary protocol

No changes to client-facing wire commands (`NEW_TXN`,
`ADD_PARTITION_TO_TXN`, `ADD_SUBSCRIPTION_TO_TXN`, `END_TXN`) — the v5
TC accepts them with the same semantics as the legacy TC.

The broker-to-broker commands `END_TXN_ON_PARTITION` and
`END_TXN_ON_SUBSCRIPTION` are **not used** by the v5 path: participant
brokers learn about the decision by watching the metadata-store header
rather than by receiving an RPC from the TC. The legacy TC still uses
these commands for v4 transactions; they remain in the protocol
unchanged.

### Configuration

A per-namespace or per-broker setting selects the TB implementation.
Default for `segment://` topics: metadata-driven. Default for
`persistent://` topics: in-stream markers (unchanged). Override is
possible per-namespace for debugging / migration.

### Metrics

Existing transaction metrics remain. The metadata-driven implementation adds:

- `pulsar_txn_metadata_store_op_writes_total` (counter) — op records written.
- `pulsar_txn_metadata_store_header_cas_total{result="ok|conflict|reject"}`
(counter) — header CAS attempts and outcomes.
- `pulsar_txn_metadata_store_index_query_seconds` (histogram) —
latency of the index range queries on `idx:writes-by-segment` /
`idx:acks-by-segment-subscription`.
- `pulsar_txn_metadata_store_outstanding_op_records` (gauge) —
uncollected op records (a proxy for txn GC backlog).

Existing `pulsar_txn_tb_*` snapshot/replay metrics are not emitted by
the new implementation (no snapshots, no replay).

---

## Backward & Forward Compatibility

### Upgrade

- Existing `persistent://` topic behavior is unchanged. v4 clients see
no difference.
- Brokers running this PIP can interoperate with brokers that do not,
as long as a given **topic** is consistently served by brokers of one
kind. Since topic ownership is bundle-based and migration via
load-balancer transfers TB state across, this is satisfied
automatically.
- Per-segment pending-ack topics created by the workaround in #25631
(`persistent://t/n/<localName>-<descriptor>-<sub>__transaction_pending_ack`)
are no longer used. They are deleted as part of upgrade. Since the
workaround was only ever exercised by V5 transactional consumer flows,
the upgrade path is safe.

### Downgrade / Rollback

Not applicable. Scalable topics are introduced as a new feature in
Pulsar 5.0 ([PIP-460](pip-460.md)); this PIP defines transactional
support for that feature from the start. There is no prior version to
roll back to.

### Pulsar Geo-Replication

Out of scope. Transactional geo-replication is not supported in either model.

---

## Alternatives Considered

### A. Move TB to the scalable-topic level (one TB per logical topic)

Earlier draft of this design. Architecturally clean — decisions live
above segments — but introduces a new broker-side singleton per
scalable topic, adds new failover semantics, and complicates the TC's
wire protocol (end-txn would need redirection from segment to
scalable-topic owner). Replacing the per-topic TB **implementation**
with a metadata-driven one achieves the same correctness without any
of that surface area.

### B. Per-segment TB but using an off-segment marker stream

Keep the per-segment TB; have it write commit/abort markers to a
**separate** managed ledger (e.g. a shadow topic) rather than into the
segment's own data. Sealed segments would no longer block end-txn.
Rejected because: (1) it doubles the data path (every txn needs a
write to the segment **and** to the shadow topic), (2) it requires a
new system-topic-per-segment, and (3) it does not eliminate the
snapshot/replay machinery that the metadata-driven approach removes
outright.

### C. Skip transactional support on scalable topics

Document scalable topics as non-transactional. Rejected: the
transactional consume-and-produce pattern is a primary use case for
scalable streaming workloads (Kafka Streams analogue), and PIP-460's
roadmap explicitly calls out transactions across range segments as a
Phase 4 deliverable.

---

## General Notes

The shape of the change is *one new `TransactionBuffer`
implementation, one new `PendingAckStore` implementation, one new
`TransactionCoordinator` implementation, and the `MetadataStore`
extensions to support partition-key co-location, sequential keys, and
secondary indexes with range-watch*. The complexity is in the
interaction of the metadata schema with the dispatcher's existing
assumptions, not in any new system component on the broker.

## Links

- [PIP-460: Scalable Topics](pip-460.md)
- [PIP-468: Scalable Topic Controller](pip-468.md)



--
Matteo Merli
<[email protected]>

[DISCUSS] PIP-471: Metadata-Driven Transactions for Scalable Topics

Reply via email to