lhotari commented on code in PR #25196:
URL: https://github.com/apache/pulsar/pull/25196#discussion_r2745453477
##########
pip/pip-454.md:
##########
@@ -0,0 +1,416 @@
+# PIP-454: Metadata Store Migration Framework
+
+## Motivation
+
+Apache Pulsar currently uses Apache ZooKeeper as its metadata store for broker
coordination, topic metadata, namespace policies, and BookKeeper ledger
management. While ZooKeeper has served well, there are several motivations for
enabling migration to alternative metadata stores:
+
+1. **Operational Simplicity**: Alternative metadata stores like Oxia may offer
simpler operations, better observability, or reduced operational overhead
compared to ZooKeeper ensembles.
+
+2. **Performance Characteristics**: Different metadata stores have different
performance profiles. Some workloads may benefit from stores optimized for high
throughput or low latency.
+
+3. **Deployment Flexibility**: Organizations may prefer metadata stores that
align better with their existing infrastructure and expertise.
+
+4. **Zero-Downtime Migration**: Operators need a safe, automated way to
migrate metadata between stores without service interruption.
+
+Currently, there is no supported path to migrate from one metadata store to
another without cluster downtime. This PIP proposes a **safe, simple migration
framework** that ensures metadata consistency by avoiding complex
dual-write/dual-read patterns. The framework enables:
+
+- **Zero-downtime migration** from any metadata store to any other supported
store
+- **Automatic ephemeral node recreation** in the target store
+- **Version preservation** to ensure conditional writes continue working
+- **Automatic failure recovery** if issues are detected
+- **Minimal configuration changes** - no config updates needed until after
migration completes
+
+## Goal
+
+Provide a safe, automated framework for migrating Apache Pulsar's metadata
from one store implementation (e.g., ZooKeeper) to another (e.g., Oxia) with
zero service interruption.
+
+### In Scope
+
+- Migration framework supporting any source → any target metadata store
+- Automatic ephemeral node recreation by brokers and bookies
+- Persistent data copy with version preservation
+- CLI commands for migration control and monitoring
+- Automatic failure recovery during migration
+- Support for broker and bookie participation
+- Read-only mode during migration for consistency
+
+### Out of Scope
+
+- Developing new metadata store implementations (Oxia, Etcd support already
exists)
+- Cross-cluster metadata synchronization (different use case)
+- Automated rollback after COMPLETED phase (requires manual intervention)
+- Migration of configuration metadata store (can be done separately)
+
+## High Level Design
+
+The migration framework introduces a **DualMetadataStore** wrapper that
transparently handles migration without modifying existing metadata store
implementations.
+
+### Key Principles
+
+1. **Transparent Wrapping**: The `DualMetadataStore` wraps the existing source
store (e.g., `ZKMetadataStore`) without modifying its implementation.
+
+2. **Lazy Target Initialization**: The target store is only initialized when
migration begins, triggered by a flag in the source store.
+
+3. **Ephemeral-First Approach**: Before copying persistent data, all brokers
and bookies recreate their ephemeral nodes in the target store. This ensures
the cluster is "live" in both stores during migration.
+
+4. **Read-Only Mode During Migration**: To ensure consistency, all metadata
writes are blocked during PREPARATION and COPYING phases. Components receive
`SessionLost` events to defer non-critical operations (e.g., ledger rollovers).
+
+5. **Phase-Based Migration**: Migration proceeds through well-defined phases
(PREPARATION → COPYING → COMPLETED).
+
+6. **Generic Framework**: The framework is agnostic to specific store
implementations - it works with any source and target that implement the
`MetadataStore` interface.
+
+7. **Guaranteed Consistency**: By blocking writes during migration and using
atomic copy, metadata is **always in a consistent state**. No dual-write
complexity, no data divergence, no consistency issues.
+
+## Detailed Design
+
+### Migration Phases
+
+```
+NOT_STARTED
+ ↓
+PREPARATION ← All brokers/bookies recreate ephemeral nodes in target
+ ← Metadata writes are BLOCKED (read-only mode)
+ ↓
+COPYING ← Coordinator copies persistent data source → target
+ ← Metadata writes still BLOCKED
+ ↓
+COMPLETED ← Migration complete, all services using target store
+ ← Metadata writes ENABLED on target
+ ↓
+After validation period:
+ * Update config and restart brokers & bookies
+ * Decommission source store
+
+(If errors occur):
+FAILED ← Rollback to source store, writes ENABLED
+```
+
+### Phase 1: NOT_STARTED → PREPARATION
+
+**Participant Registration (at startup):**
+Each broker and bookie registers itself as a migration participant by creating
a sequential ephemeral node:
+- Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential)
+- This allows the coordinator to know how many participants exist before
migration starts
+
+**Administrator triggers migration:**
+```bash
+pulsar-admin metadata-migration start --target oxia://oxia1:6648
+```
+
+**Coordinator actions:**
+1. Creates migration flag in source store:
`/pulsar/migration-coordinator/migration`
+ ```json
+ {
+ "phase": "PREPARATION",
+ "targetUrl": "oxia://oxia1:6648"
+ }
+ ```
+
+**Broker/Bookie actions (automatic, triggered by watching the flag):**
+1. Detect migration flag via watch on `/pulsar/migration-coordinator/migration`
+2. Defer non-critical metadata writes (e.g., ledger rollovers, bundle
ownership changes)
Review Comment:
would this use the existing solution? a
`SessionEvent.ConnectionLost`/`SessionEvent.SessionLost` event sets a flag
`metadataServiceAvailable` that is used for this purpose in many locations.
https://github.com/apache/pulsar/blob/d630394cdd02792b2dbc3a55443637a5d593a137/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java#L148-L152
https://github.com/apache/pulsar/blob/1617bb22173a117f24d47ac6f11cc2f7c68de635/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java#L288-L291
It seems that currently ledger trimming, ledger rollover and loadbalancer
load shedding are using the the `metadataServiceAvailable` flag in
`ManagedLedgerFactoryImpl`.
There's also a dependency on the event directly:
https://github.com/apache/pulsar/blob/38807b1511ba3b8c150d69c16a0c3ae36f321dac/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L1137-L1150
**would the coordinator send a `SessionEvent.ConnectionLost` event when
migration starts so that it remains compatible with the existing solution?**
`AbstractMetadataStore` has a flag `isConnected` which could also be useful?
it's not currently used within Pulsar, just for metadata store caching
decisions. I guess it would be necessary to skip cache refreshs while the
migration is on-going.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]