Re: [PR] [feat][pip] PIP-454: Metadata Store Migration Framework [pulsar]

via GitHub Fri, 30 Jan 2026 01:45:40 -0800


lhotari commented on code in PR #25196:
URL: https://github.com/apache/pulsar/pull/25196#discussion_r2745453477



##########
pip/pip-454.md:
##########
@@ -0,0 +1,416 @@
+# PIP-454: Metadata Store Migration Framework
+
+## Motivation
+
+Apache Pulsar currently uses Apache ZooKeeper as its metadata store for broker 
coordination, topic metadata, namespace policies, and BookKeeper ledger 
management. While ZooKeeper has served well, there are several motivations for 
enabling migration to alternative metadata stores:
+
+1. **Operational Simplicity**: Alternative metadata stores like Oxia may offer 
simpler operations, better observability, or reduced operational overhead 
compared to ZooKeeper ensembles.
+
+2. **Performance Characteristics**: Different metadata stores have different 
performance profiles. Some workloads may benefit from stores optimized for high 
throughput or low latency.
+
+3. **Deployment Flexibility**: Organizations may prefer metadata stores that 
align better with their existing infrastructure and expertise.
+
+4. **Zero-Downtime Migration**: Operators need a safe, automated way to 
migrate metadata between stores without service interruption.
+
+Currently, there is no supported path to migrate from one metadata store to 
another without cluster downtime. This PIP proposes a **safe, simple migration 
framework** that ensures metadata consistency by avoiding complex 
dual-write/dual-read patterns. The framework enables:
+
+- **Zero-downtime migration** from any metadata store to any other supported 
store
+- **Automatic ephemeral node recreation** in the target store
+- **Version preservation** to ensure conditional writes continue working
+- **Automatic failure recovery** if issues are detected
+- **Minimal configuration changes** - no config updates needed until after 
migration completes
+
+## Goal
+
+Provide a safe, automated framework for migrating Apache Pulsar's metadata 
from one store implementation (e.g., ZooKeeper) to another (e.g., Oxia) with 
zero service interruption.
+
+### In Scope
+
+- Migration framework supporting any source → any target metadata store
+- Automatic ephemeral node recreation by brokers and bookies
+- Persistent data copy with version preservation
+- CLI commands for migration control and monitoring
+- Automatic failure recovery during migration
+- Support for broker and bookie participation
+- Read-only mode during migration for consistency
+
+### Out of Scope
+
+- Developing new metadata store implementations (Oxia, Etcd support already 
exists)
+- Cross-cluster metadata synchronization (different use case)
+- Automated rollback after COMPLETED phase (requires manual intervention)
+- Migration of configuration metadata store (can be done separately)
+
+## High Level Design
+
+The migration framework introduces a **DualMetadataStore** wrapper that 
transparently handles migration without modifying existing metadata store 
implementations.
+
+### Key Principles
+
+1. **Transparent Wrapping**: The `DualMetadataStore` wraps the existing source 
store (e.g., `ZKMetadataStore`) without modifying its implementation.
+
+2. **Lazy Target Initialization**: The target store is only initialized when 
migration begins, triggered by a flag in the source store.
+
+3. **Ephemeral-First Approach**: Before copying persistent data, all brokers 
and bookies recreate their ephemeral nodes in the target store. This ensures 
the cluster is "live" in both stores during migration.
+
+4. **Read-Only Mode During Migration**: To ensure consistency, all metadata 
writes are blocked during PREPARATION and COPYING phases. Components receive 
`SessionLost` events to defer non-critical operations (e.g., ledger rollovers).
+
+5. **Phase-Based Migration**: Migration proceeds through well-defined phases 
(PREPARATION → COPYING → COMPLETED).
+
+6. **Generic Framework**: The framework is agnostic to specific store 
implementations - it works with any source and target that implement the 
`MetadataStore` interface.
+
+7. **Guaranteed Consistency**: By blocking writes during migration and using 
atomic copy, metadata is **always in a consistent state**. No dual-write 
complexity, no data divergence, no consistency issues.
+
+## Detailed Design
+
+### Migration Phases
+
+```
+NOT_STARTED
+     ↓
+PREPARATION ← All brokers/bookies recreate ephemeral nodes in target
+             ← Metadata writes are BLOCKED (read-only mode)
+     ↓
+COPYING ← Coordinator copies persistent data source → target
+         ← Metadata writes still BLOCKED
+     ↓
+COMPLETED ← Migration complete, all services using target store
+          ← Metadata writes ENABLED on target
+     ↓
+After validation period:
+ * Update config and restart brokers & bookies 
+ * Decommission source store
+
+(If errors occur):
+FAILED ← Rollback to source store, writes ENABLED
+```
+
+### Phase 1: NOT_STARTED → PREPARATION
+
+**Participant Registration (at startup):**
+Each broker and bookie registers itself as a migration participant by creating 
a sequential ephemeral node:
+- Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential)
+- This allows the coordinator to know how many participants exist before 
migration starts
+
+**Administrator triggers migration:**
+```bash
+pulsar-admin metadata-migration start --target oxia://oxia1:6648
+```
+
+**Coordinator actions:**
+1. Creates migration flag in source store: 
`/pulsar/migration-coordinator/migration`
+   ```json
+   {
+     "phase": "PREPARATION",
+     "targetUrl": "oxia://oxia1:6648"
+   }
+   ```
+
+**Broker/Bookie actions (automatic, triggered by watching the flag):**
+1. Detect migration flag via watch on `/pulsar/migration-coordinator/migration`
+2. Defer non-critical metadata writes (e.g., ledger rollovers, bundle 
ownership changes)

Review Comment:
   would this use the existing solution? a 
`SessionEvent.ConnectionLost`/`SessionEvent.SessionLost` event sets a flag 
`metadataServiceAvailable` that is used for this purpose in many locations. 
   
   
https://github.com/apache/pulsar/blob/d630394cdd02792b2dbc3a55443637a5d593a137/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java#L148-L152
   
   
https://github.com/apache/pulsar/blob/1617bb22173a117f24d47ac6f11cc2f7c68de635/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java#L288-L291
   
   It seems that currently ledger trimming, ledger rollover and loadbalancer 
load shedding are using the the `metadataServiceAvailable` flag in 
`ManagedLedgerFactoryImpl`.
   
   There's also a dependency on the event directly:
   
https://github.com/apache/pulsar/blob/38807b1511ba3b8c150d69c16a0c3ae36f321dac/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L1137-L1150
   
   **would the coordinator send a `SessionEvent.ConnectionLost` event when 
migration starts so that it remains compatible with the existing solution?**
   
   `AbstractMetadataStore` has a flag `isConnected` which could also be useful? 
it's not currently used within Pulsar, just for metadata store caching 
decisions. I guess it would be necessary to skip cache refreshs while the 
migration is on-going.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [feat][pip] PIP-454: Metadata Store Migration Framework [pulsar]

Reply via email to