lhotari commented on code in PR #25196: URL: https://github.com/apache/pulsar/pull/25196#discussion_r2745469752
########## pip/pip-454.md: ########## @@ -0,0 +1,416 @@ +# PIP-454: Metadata Store Migration Framework + +## Motivation + +Apache Pulsar currently uses Apache ZooKeeper as its metadata store for broker coordination, topic metadata, namespace policies, and BookKeeper ledger management. While ZooKeeper has served well, there are several motivations for enabling migration to alternative metadata stores: + +1. **Operational Simplicity**: Alternative metadata stores like Oxia may offer simpler operations, better observability, or reduced operational overhead compared to ZooKeeper ensembles. + +2. **Performance Characteristics**: Different metadata stores have different performance profiles. Some workloads may benefit from stores optimized for high throughput or low latency. + +3. **Deployment Flexibility**: Organizations may prefer metadata stores that align better with their existing infrastructure and expertise. + +4. **Zero-Downtime Migration**: Operators need a safe, automated way to migrate metadata between stores without service interruption. + +Currently, there is no supported path to migrate from one metadata store to another without cluster downtime. This PIP proposes a **safe, simple migration framework** that ensures metadata consistency by avoiding complex dual-write/dual-read patterns. The framework enables: + +- **Zero-downtime migration** from any metadata store to any other supported store +- **Automatic ephemeral node recreation** in the target store +- **Version preservation** to ensure conditional writes continue working +- **Automatic failure recovery** if issues are detected +- **Minimal configuration changes** - no config updates needed until after migration completes + +## Goal + +Provide a safe, automated framework for migrating Apache Pulsar's metadata from one store implementation (e.g., ZooKeeper) to another (e.g., Oxia) with zero service interruption. + +### In Scope + +- Migration framework supporting any source → any target metadata store +- Automatic ephemeral node recreation by brokers and bookies +- Persistent data copy with version preservation +- CLI commands for migration control and monitoring +- Automatic failure recovery during migration +- Support for broker and bookie participation +- Read-only mode during migration for consistency + +### Out of Scope + +- Developing new metadata store implementations (Oxia, Etcd support already exists) +- Cross-cluster metadata synchronization (different use case) +- Automated rollback after COMPLETED phase (requires manual intervention) +- Migration of configuration metadata store (can be done separately) + +## High Level Design + +The migration framework introduces a **DualMetadataStore** wrapper that transparently handles migration without modifying existing metadata store implementations. + +### Key Principles + +1. **Transparent Wrapping**: The `DualMetadataStore` wraps the existing source store (e.g., `ZKMetadataStore`) without modifying its implementation. + +2. **Lazy Target Initialization**: The target store is only initialized when migration begins, triggered by a flag in the source store. + +3. **Ephemeral-First Approach**: Before copying persistent data, all brokers and bookies recreate their ephemeral nodes in the target store. This ensures the cluster is "live" in both stores during migration. + +4. **Read-Only Mode During Migration**: To ensure consistency, all metadata writes are blocked during PREPARATION and COPYING phases. Components receive `SessionLost` events to defer non-critical operations (e.g., ledger rollovers). + +5. **Phase-Based Migration**: Migration proceeds through well-defined phases (PREPARATION → COPYING → COMPLETED). + +6. **Generic Framework**: The framework is agnostic to specific store implementations - it works with any source and target that implement the `MetadataStore` interface. + +7. **Guaranteed Consistency**: By blocking writes during migration and using atomic copy, metadata is **always in a consistent state**. No dual-write complexity, no data divergence, no consistency issues. + +## Detailed Design + +### Migration Phases + +``` +NOT_STARTED + ↓ +PREPARATION ← All brokers/bookies recreate ephemeral nodes in target + ← Metadata writes are BLOCKED (read-only mode) + ↓ +COPYING ← Coordinator copies persistent data source → target + ← Metadata writes still BLOCKED + ↓ +COMPLETED ← Migration complete, all services using target store + ← Metadata writes ENABLED on target + ↓ +After validation period: + * Update config and restart brokers & bookies + * Decommission source store + +(If errors occur): +FAILED ← Rollback to source store, writes ENABLED +``` + +### Phase 1: NOT_STARTED → PREPARATION + +**Participant Registration (at startup):** +Each broker and bookie registers itself as a migration participant by creating a sequential ephemeral node: +- Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential) +- This allows the coordinator to know how many participants exist before migration starts Review Comment: Which node is selected as the migration coordinator? If it's the broker, what if the migration takes a lot more memory than the broker usually does and causes OOME. Would it be possible to deploy a dedicated coordinator or run the coordinator in-process, let's say in a pod with sufficient resources, running in a Pulsar cluster? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
