Thanks for voting, closing this PIP vote: 3 binding +1s: * Lari * Qiang * Matteo
1 non-binding +1: * Tao thanks, Matteo -- Matteo Merli <[email protected]> On Tue, Feb 24, 2026 at 10:04 AM mattison chao <[email protected]> wrote: > > +1 (binding) > > > > On Sun, 8 Feb 2026 at 17:17, Tao Jiuming <[email protected]> wrote: > > > +1 nonbinding > > > > Matteo Merli <[email protected]>于2026年2月6日 周五01:43写道: > > > > > PIP PR: https://github.com/apache/pulsar/pull/25196 > > > > > > PR with implementation: https://github.com/apache/pulsar/pull/25219 > > > > > > ---- > > > > > > # PIP-454: Metadata Store Migration Framework > > > > > > ## Motivation > > > > > > Apache Pulsar currently uses Apache ZooKeeper as its metadata store > > > for broker coordination, topic metadata, namespace policies, and > > > BookKeeper ledger management. While ZooKeeper has served well, there > > > are several motivations for enabling migration to alternative metadata > > > stores: > > > > > > 1. **Operational Simplicity**: Alternative metadata stores like Oxia > > > may offer simpler operations, better observability, or reduced > > > operational overhead compared to ZooKeeper ensembles. > > > > > > 2. **Performance Characteristics**: Different metadata stores have > > > different performance profiles. Some workloads may benefit from stores > > > optimized for high throughput or low latency. > > > > > > 3. **Deployment Flexibility**: Organizations may prefer metadata > > > stores that align better with their existing infrastructure and > > > expertise. > > > > > > 4. **Zero-Downtime Migration**: Operators need a safe, automated way > > > to migrate metadata between stores without service interruption. > > > > > > Currently, there is no supported path to migrate from one metadata > > > store to another without cluster downtime. This PIP proposes a **safe, > > > simple migration framework** that ensures metadata consistency by > > > avoiding complex dual-write/dual-read patterns. The framework enables: > > > > > > - **Zero-downtime migration** from any metadata store to any other > > > supported store > > > - **Automatic ephemeral node recreation** in the target store > > > - **Version preservation** to ensure conditional writes continue working > > > - **Automatic failure recovery** if issues are detected > > > - **Minimal configuration changes** - no config updates needed until > > > after migration completes > > > > > > ## Goal > > > > > > Provide a safe, automated framework for migrating Apache Pulsar's > > > metadata from one store implementation (e.g., ZooKeeper) to another > > > (e.g., Oxia) with zero service interruption. > > > > > > ### In Scope > > > > > > - Migration framework supporting any source → any target metadata store > > > - Automatic ephemeral node recreation by brokers and bookies > > > - Persistent data copy with version preservation > > > - CLI commands for migration control and monitoring > > > - Automatic failure recovery during migration > > > - Support for broker and bookie participation > > > - Read-only mode during migration for consistency > > > > > > ### Out of Scope > > > > > > - Developing new metadata store implementations (Oxia, Etcd support > > > already exists) > > > - Cross-cluster metadata synchronization (different use case) > > > - Automated rollback after COMPLETED phase (requires manual intervention) > > > - Migration of configuration metadata store and geo-replicated > > > clusters (can be done separately) > > > > > > ## High Level Design > > > > > > The migration framework introduces a **DualMetadataStore** wrapper > > > that transparently handles migration without modifying existing > > > metadata store implementations. > > > > > > ### Key Principles > > > > > > 1. **Transparent Wrapping**: The `DualMetadataStore` wraps the > > > existing source store (e.g., `ZKMetadataStore`) without modifying its > > > implementation. > > > > > > 2. **Lazy Target Initialization**: The target store is only > > > initialized when migration begins, triggered by a flag in the source > > > store. > > > > > > 3. **Ephemeral-First Approach**: Before copying persistent data, all > > > brokers and bookies recreate their ephemeral nodes in the target > > > store. This ensures the cluster is "live" in both stores during > > > migration. > > > > > > 4. **Read-Only Mode During Migration**: To ensure consistency, all > > > metadata writes are blocked during PREPARATION and COPYING phases. > > > Components receive `SessionLost` events to defer non-critical > > > operations (e.g., ledger rollovers). > > > > > > 5. **Phase-Based Migration**: Migration proceeds through well-defined > > > phases (PREPARATION → COPYING → COMPLETED). > > > > > > 6. **Generic Framework**: The framework is agnostic to specific store > > > implementations - it works with any source and target that implement > > > the `MetadataStore` interface. > > > > > > 7. **Guaranteed Consistency**: By blocking writes during migration and > > > using atomic copy, metadata is **always in a consistent state**. No > > > dual-write complexity, no data divergence, no consistency issues. > > > > > > ## Detailed Design > > > > > > ### Migration Phases > > > > > > ``` > > > NOT_STARTED > > > ↓ > > > PREPARATION ← All brokers/bookies recreate ephemeral nodes in target > > > ← Metadata writes are BLOCKED (read-only mode) > > > ↓ > > > COPYING ← Coordinator copies persistent data source → target > > > ← Metadata writes still BLOCKED > > > ↓ > > > COMPLETED ← Migration complete, all services using target store > > > ← Metadata writes ENABLED on target > > > ↓ > > > After validation period: > > > * Update config and restart brokers & bookies > > > * Decommission source store > > > > > > (If errors occur): > > > FAILED ← Rollback to source store, writes ENABLED > > > ``` > > > > > > ### Phase 1: NOT_STARTED → PREPARATION > > > > > > **Participant Registration (at startup):** > > > Each broker and bookie registers itself as a migration participant by > > > creating a sequential ephemeral node: > > > - Path: `/pulsar/migration-coordinator/participants/id-NNNN` (sequential) > > > - This allows the coordinator to know how many participants exist > > > before migration starts > > > > > > **Administrator triggers migration:** > > > ```bash > > > pulsar-admin metadata-migration start --target oxia://oxia1:6648 > > > ``` > > > > > > **Coordinator actions:** > > > 1. Creates migration flag in source store: > > > `/pulsar/migration-coordinator/migration` > > > ```json > > > { > > > "phase": "PREPARATION", > > > "targetUrl": "oxia://oxia1:6648" > > > } > > > ``` > > > > > > **Broker/Bookie actions (automatic, triggered by watching the flag):** > > > 1. Detect migration flag via watch on > > > `/pulsar/migration-coordinator/migration` > > > 2. Defer non-critical metadata writes (e.g., ledger rollovers, bundle > > > ownership changes) > > > 3. Initialize connection to target store > > > 4. Recreate ALL ephemeral nodes in target store > > > 5. **Delete** participant registration node to signal "ready" > > > > > > **Coordinator waits for all participant nodes to be deleted > > > (indicating all participants are ready)** > > > > > > ### Phase 2: PREPARATION → COPYING > > > > > > **Coordinator actions:** > > > 1. Updates phase to `COPYING` > > > 2. Performs recursive copy of persistent data from source → target: > > > - Skips ephemeral nodes (already recreated) > > > - Concurrent operations limited by semaphore (default: 1000 pending > > ops) > > > - Breadth-first traversal to process all paths > > > - Progress logged periodically > > > > > > **During this phase:** > > > - Brokers/bookies continue normal READ operations > > > - Metadata WRITES are BLOCKED (return failure) > > > - Ephemeral nodes remain alive in both stores > > > - All reads still go to source store > > > > > > **During this phase:** > > > - Metadata writes are BLOCKED (return error to clients) > > > - Metadata reads continue normally from source store > > > - **Data plane operations unaffected**: Publish/consume, ledger writes > > > continue normally > > > - Version-id and modification count preserved using direct Oxia client > > > - Breadth-first traversal with max 1000 concurrent operations > > > > > > **Estimated duration:** > > > - **< 30 seconds** for typical deployments with up to **500 MB of > > > metadata** in ZooKeeper > > > > > > **Impact on operations:** > > > - ✅ Existing topics: Publish and consume continue without interruption > > > - ✅ BookKeeper: Ledger writes and reads continue normally > > > - ✅ Clients: Connected producers and consumers unaffected > > > - ❌ Admin operations: Topic/namespace creation blocked temporarily > > > - ❌ Bundle operations: Load balancing deferred until completion > > > > > > ### Phase 3: COPYING → COMPLETED > > > > > > **Coordinator actions:** > > > 1. Updates phase to `COMPLETED` > > > 2. Logs success message with total copied node count > > > > > > **Broker/Bookie actions (automatic, triggered by phase update):** > > > 1. Detect `COMPLETED` phase > > > 2. Deferred operations can now proceed > > > 3. Switch routing: > > > - **Writes**: Go to target store only > > > - **Reads**: Go to target store only > > > > > > **At this point:** > > > - Cluster is running on target store > > > - Source store remains available for safety > > > - Metadata writes are enabled again > > > > > > **Operator follow-up (after validation period):** > > > 1. Update configuration files: > > > ```properties > > > # Before (ZooKeeper): > > > metadataStoreUrl=zk://zk1:2181,zk2:2181/pulsar > > > > > > # After (Oxia): > > > metadataStoreUrl=oxia://oxia1:6648 > > > ``` > > > 2. Perform rolling restart with new config > > > 3. After all services restarted, decommission source store > > > > > > ### Failure Handling: ANY_PHASE → FAILED > > > > > > **If migration fails at any point:** > > > 1. Coordinator updates phase to `FAILED` > > > 2. Broker/Bookie actions: > > > - Detect `FAILED` phase > > > - Discard target store connection > > > - Continue using source store > > > - Metadata writes enabled again > > > > > > **Operator actions:** > > > 1. Review logs to understand failure cause > > > 2. Fix underlying issue > > > 3. Retry migration with `pulsar-admin metadata-migration start --target > > > <url>` > > > > > > ## Implementation Details > > > > > > > > > ### Key Implementation Details: > > > > > > 1. **Direct Oxia Client Usage**: The coordinator uses > > > `AsyncOxiaClient` directly instead of going through `MetadataStore` > > > interface. This allows setting version-id and modification count to > > > match the source values, ensuring conditional writes (compare-and-set > > > operations) continue to work correctly after migration. > > > > > > 2. **Breadth-First Traversal**: Processes paths level by level using a > > > work queue, enabling high concurrency while preventing deep recursion. > > > > > > 3. **Concurrent Operations**: Uses a semaphore to limit pending > > > operations (default: 1000), balancing throughput with memory usage. > > > > > > ### Data Structures > > > > > > **Migration State** (`/pulsar/migration-coordinator/migration`): > > > ```json > > > { > > > "phase": "PREPARATION", > > > "targetUrl": "oxia://oxia1:6648/default" > > > } > > > ``` > > > > > > Fields: > > > - `phase`: Current migration phase (NOT_STARTED, PREPARATION, COPYING, > > > COMPLETED, FAILED) > > > - `targetUrl`: Target metadata store URL (e.g., > > > `oxia://oxia1:6648/default`) > > > > > > **Participant Registration** > > > (`/pulsar/migration-coordinator/participants/id-NNNN`): > > > - Sequential ephemeral node created by each broker/bookie at startup > > > - Empty data (presence indicates participation) > > > - Deleted by participant when preparation complete (signals "ready") > > > - Coordinator waits for all to be deleted before proceeding to COPYING > > > phase > > > > > > **No additional state tracking**: The simplified design removes > > > complex state tracking and checksums. Migration state is kept minimal. > > > > > > ### CLI Commands > > > > > > ```bash > > > # Start migration > > > pulsar-admin metadata-migration start --target <target-url> > > > > > > # Check status > > > pulsar-admin metadata-migration status > > > ``` > > > > > > The simplified design only requires two commands. Rollback happens > > > automatically if migration fails (phase transitions to FAILED). > > > > > > ### REST API > > > > > > ``` > > > POST /admin/v2/metadata/migration/start > > > Body: { "targetUrl": "oxia://..." } > > > > > > GET /admin/v2/metadata/migration/status > > > Returns: { "phase": "COPYING", "targetUrl": "oxia://..." } > > > ``` > > > > > > ## Safety Guarantees > > > > > > ### Why This Approach is Safe > > > > > > **The migration design guarantees metadata consistency by avoiding > > > dual-write and dual-read patterns entirely:** > > > > > > 1. **Single Source of Truth**: At any given time, there is exactly ONE > > > active metadata store: > > > - Before migration: Source store (ZooKeeper) > > > - During PREPARATION and COPYING: Source store (read-only) > > > - After COMPLETED: Target store (Oxia) > > > > > > 2. **No Dual-Write Complexity**: Unlike approaches that write to both > > > stores simultaneously, this design eliminates: > > > - Write synchronization issues > > > - Conflict resolution between stores > > > - Data divergence problems > > > - Partial failure handling complexity > > > > > > 3. **No Dual-Read Complexity**: Unlike approaches that read from both > > > stores, this design eliminates: > > > - Read consistency issues > > > - Cache invalidation across stores > > > - Stale data problems > > > - Complex fallback logic > > > > > > 4. **Atomic Cutover**: All participants switch stores simultaneously > > > when COMPLETED phase is detected. There is no ambiguous state where > > > some participants use one store and others use another. > > > > > > 5. **Fast Migration Window**: With **< 30 seconds** for typical > > > metadata sizes (even up to 500 MB), the read-only window is minimal > > > and acceptable for most production environments. > > > > > > **Bottom line**: Metadata is **always in a consistent state** - either > > > fully in the source store or fully in the target store, never split or > > > diverged between them. > > > > > > ### Data Integrity > > > > > > 1. **Version Preservation**: All persistent data is copied with > > > original version-id and modification count preserved. This ensures > > > conditional writes (compare-and-set operations) continue working after > > > migration. > > > > > > 2. **Ephemeral Node Recreation**: All ephemeral nodes are recreated by > > > their owning brokers/bookies before persistent data copy begins. > > > > > > 3. **Read-Only Mode**: All metadata writes are blocked during > > > PREPARATION and COPYING phases, ensuring no data inconsistencies > > > during migration. > > > > > > **Important**: Read-only mode only affects metadata operations. > > > Data plane operations continue normally: > > > - ✅ **Publishing and consuming messages** works without interruption > > > - ✅ **Reading from existing topics and subscriptions** works normally > > > - ✅ **Ledger writes to BookKeeper** continue unaffected > > > - ❌ **Creating new topics or subscriptions** will be blocked > > temporarily > > > - ❌ **Namespace/policy updates** will be blocked temporarily > > > - ❌ **Bundle ownership changes** will be deferred until migration > > > completes > > > > > > ### Operational Safety > > > > > > 1. **No Downtime**: Brokers and bookies remain online throughout the > > > migration. **Data plane operations (publish/consume) continue without > > > interruption.** Only metadata operations are temporarily blocked > > > during the migration phases. > > > > > > 2. **Graceful Failure**: If migration fails at any point, phase > > > transitions to FAILED and cluster returns to source store > > > automatically. > > > > > > 3. **Session Events**: Components receive `SessionLost` event during > > > migration to defer non-critical writes (e.g., ledger rollovers), and > > > `SessionReestablished` when migration completes or fails. > > > > > > 4. **Participant Coordination**: Migration waits for all participants > > > to complete preparation before copying data. > > > > > > ### Consistency > > > > > > 1. **Atomic Cutover**: All participants switch to target store > > > simultaneously when COMPLETED phase is detected. > > > > > > 2. **Ephemeral Session Consistency**: Each participant manages its own > > > ephemeral nodes in target store with proper session management. > > > > > > 3. **No Dual-Write Complexity**: By blocking writes during migration, > > > we avoid complex dual-write error handling and data divergence issues. > > > > > > ## Configuration > > > > > > ### No Configuration Changes for Migration > > > > > > The beauty of this design is that **no configuration changes are > > > needed to start migration**: > > > > > > - Brokers and bookies continue using their existing `metadataStoreUrl` > > > config > > > - The `DualMetadataStore` wrapper is automatically applied when using > > > ZooKeeper > > > - Target URL is provided only when triggering migration via CLI > > > > > > ### Post-Migration Configuration > > > > > > After migration completes and validation period ends, update config > > files: > > > > > > ```properties > > > # Before migration > > > metadataStoreUrl=zk://zk1:2181,zk2:2181,zk3:2181/pulsar > > > > > > # After migration (update and rolling restart) > > > metadataStoreUrl=oxia://oxia1:6648 > > > ``` > > > > > > ## Comparison with Kafka's ZooKeeper → KRaft Migration > > > > > > Apache Kafka faced a similar challenge migrating from ZooKeeper to > > > KRaft (Kafka Raft). Their approach provides useful comparison points: > > > > > > ### Kafka's Approach (KIP-866) > > > > > > **Migration Strategy:** > > > - **Dual-mode operation**: Kafka brokers run in a hybrid mode where > > > the KRaft controller reads from ZooKeeper > > > - **Metadata synchronization**: KRaft controller actively mirrors > > > metadata from ZooKeeper to KRaft > > > - **Phased cutover**: Operators manually transition from ZK_MIGRATION > > > mode to KRAFT mode > > > - **Write forwarding**: During migration, metadata writes go to > > > ZooKeeper and are replicated to KRaft > > > > > > **Timeline:** > > > - Migration can take hours or days as metadata is continuously > > synchronized > > > - Requires careful monitoring of lag between ZooKeeper and KRaft > > > - Rollback possible until final KRAFT mode is committed > > > > > > ### Pulsar's Approach (This PIP) > > > > > > **Migration Strategy:** > > > - **Transparent wrapper**: DualMetadataStore wraps existing store > > > without broker code changes > > > - **Read-only migration**: Metadata writes blocked during migration (< > > > 30 seconds for most clusters) > > > - **Atomic copy**: All persistent data copied in one operation with > > > version preservation > > > - **Single source of truth**: No dual-write or dual-read - metadata > > > always consistent > > > - **Automatic cutover**: All participants switch simultaneously when > > > COMPLETED phase detected > > > > > > **Timeline:** > > > - Migration completes in **< 30 seconds** for typical deployments > > > (even up to 500 MB metadata) > > > - No lag monitoring needed > > > - Automatic rollback on failure (FAILED phase) > > > > > > ### Key Differences > > > > > > | Aspect | Kafka (ZK → KRaft) | Pulsar (ZK → Oxia) | > > > |--------|-------------------|-------------------| > > > | **Migration Duration** | Hours to days | **< 30 seconds** (up to 500 > > MB) > > > | > > > | **Metadata Writes** | Continue during migration | Blocked during > > > migration | > > > | **Data Plane** | Unaffected | Unaffected (publish/consume continues) | > > > | **Approach** | Continuous sync + dual-mode | Atomic copy + read-only > > > mode | > > > | **Consistency** | Dual-write (eventual consistency) | **Single > > > source of truth (always consistent)** | > > > | **Complexity** | High (dual-mode broker logic) | Low (transparent > > > wrapper) | > > > | **Version Preservation** | Not applicable (different metadata > > > models) | Yes (conditional writes preserved) | > > > | **Rollback** | Manual, complex | Automatic on failure | > > > | **Monitoring** | Requires lag tracking | Simple phase monitoring | > > > > > > ### Why Pulsar's Approach Differs > > > > > > 1. **Data Plane Independence**: **The key insight is that Pulsar's > > > data plane (publish/consume, ledger writes) does not require metadata > > > writes to function.** This architectural property allows pausing > > > metadata writes for a brief period (< 30 seconds) without affecting > > > data operations. This is what makes the migration **provably safe and > > > consistent**, not the metadata size. > > > > > > 2. **Write-Pause Safety**: Pausing writes during copy ensures: > > > - No dual-write complexity > > > - No data divergence between stores > > > - No conflict resolution needed > > > - Guaranteed consistency > > > > > > This works regardless of metadata size - whether 50K nodes or > > > millions of topics. The migration handles large metadata volumes > > > through high concurrency (1000 parallel operations), completing in < > > > 30 seconds even for 500 MB. > > > > > > 3. **Ephemeral Node Handling**: Pulsar has significant ephemeral > > > metadata (broker registrations, bundle ownership), making dual-write > > > complex. Read-only mode simplifies this. > > > > > > 4. **Conditional Writes**: Pulsar relies heavily on compare-and-set > > > operations. Version preservation ensures these continue working > > > post-migration, which Kafka doesn't need to address. > > > > > > 5. **Architectural Enabler**: Pulsar's separation of data plane and > > > metadata plane allows brief metadata write pauses without data plane > > > impact, enabling a simpler, safer migration approach. > > > > > > ### Lessons from Kafka's Experience > > > > > > Pulsar's design incorporates lessons from Kafka's migration: > > > > > > - ✅ **Avoid dual-write complexity**: Kafka found dual-mode operation > > > added significant code complexity. Pulsar's read-only approach is > > > simpler **and guarantees consistency**. > > > - ✅ **Clear phase boundaries**: Kafka's migration has unclear > > > "completion" point. Pulsar has explicit COMPLETED phase. > > > - ✅ **Automatic participant coordination**: Kafka requires manual > > > broker restarts. Pulsar participants coordinate automatically. > > > - ✅ **Fast migration**: **< 30 seconds** read-only window is > > > acceptable for most production environments > > > - ❌ **Brief write unavailability**: Pulsar accepts brief metadata > > > write unavailability (< 30 sec) vs Kafka's continuous operation, but > > > gains guaranteed consistency and simplicity. > > > > > > > > > ## References > > > > > > - [PIP-45: Pluggable metadata > > > interface]( > > > > > https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface > > > ) > > > - [Oxia: A Scalable Metadata Store](https://github.com/streamnative/oxia > > ) > > > - [MetadataStore > > > Interface]( > > > > > https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java > > > ) > > > - [KIP-866: ZooKeeper to KRaft > > > Migration]( > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration > > > ) > > > - Kafka's approach to metadata store migration > > > > > > > > > -- > > > Matteo Merli > > > <[email protected]> > > > > >
