Re: [DISCUSS] Replace Imap by rocksdb - V2

Doyeon Kim Sun, 19 Oct 2025 16:18:46 -0700

Thank you for the feedback!
I was wondering if you have any thoughts on how the RocksDB storage might
be extended or evolved in the future.
I’d like to take that into account while working on the implementation.


On Sun, Oct 19, 2025 at 4:58 PM Jia Fan <[email protected]> wrote:

> Hi @dybyte . Thanks for you work! I think we should only keep replace
> imap data by rocksdb. The checkpoint data work fine now. So at this
> stage, we are focusing on replacing IMAP. Checkpoint data does not use
> IMAP.
>
> Regarding the data partitioning issue, I think we can abandon storing
> data in different locations and simply store all data on the master
> node. Other nodes can retrieve data by sending RPC requests to the
> master. RocksDB can store files on S3. This way, when the master node
> changes, all data can still be retrieved from S3 without having to
> maintain data accuracy ourselves.
>
> Doyeon Kim <[email protected]> 于2025年10月3日周五 20:13写道：
> >
> > Seatunnel RocksDB-based State Backend V2 Design Doc
> > ------------------------------
> > 1. Partition Rebalancing & Data Migration Motivation and Problem
> Definition
> >
> >    - When cluster topology changes (e.g., node join/leave/failure),
> >    partition (Shard/KeyGroup) ownership must be reassigned.
> >    - Partition data stored on the previous owner node must be
> automatically
> >    migrated to the new owner node.
> >    - If not implemented, there can be data inaccessibility or temporary
> >    data loss during partition migration.
> >
> > Design Approach (Reflected in Architecture/Diagrams)
> >
> >    - *External Storage-Based Partition Migration*
> >       - Each partition’s RocksDB instance periodically uploads a snapshot
> >       to external storage (HDFS, Local FS etc).
> >       - When partition ownership transfer is decided (by
> PartitionService),
> >       the current owner node creates and uploads the latest snapshot.
> >       - The new owner node downloads and restores the partition snapshot
> >       from external storage.
> >       - Data movement is “file-based” for simplicity and to avoid
> >       lock/network concurrency issues.
> >    - *PartitionService Event Trigger*
> >       - PartitionService detects and broadcasts partition ownership
> change
> >       events (leveraging Hazelcast's MembershipListener,
> MigrationListener,
> >       PartitionLostListener).
> >       - Write/read operations for the affected partition are restricted
> >       until ownership change and restore are complete (to ensure
> consistency).
> >
> > Open Issues & Discussion Points
> >
> >    - One open point is how to ensure partition data consistency during
> >    migration (e.g., blocking or synchronizing writes before snapshot).
> Do you
> >    think there could be a better alternative?
> >    -
> >
> > ------------------------------
> > 2. Replication & High Availability (HA) Motivation and Problem Definition
> >
> >    - Data loss must be prevented if a single node fails.
> >    - “Always recoverable” data is required for production environments.
> >
> > Design Approach (Reflected in Diagrams)
> >
> >    - *Periodic External Storage Snapshots for HA*
> >       - Each partition’s RocksDB instance saves a snapshot to external
> >       storage (HDFS, Local, etc) at a configured interval.
> >       - On node failure, the affected partition is reassigned to a new
> >       node, which downloads and restores the latest snapshot.
> >       - This minimizes service downtime and data loss (within the
> snapshot
> >       interval).
> >    - *Real-Time Replication (Primary/Replica)*
> >       - For large-scale RocksDB state or ETL pipelines, real-time
> >       replication brings significant complexity and performance
> > overhead, and is
> >       often less effective operationally.
> >       - The default is periodic snapshots; real-time replication may be
> >       considered only if explicitly required.
> >
> > Open Issues & Discussion Points
> >
> >    - Assessing the need for real-time replication/Primary-Replica
> (periodic
> >    snapshots are the default)
> >    - Balancing snapshot interval, performance, and storage cost:
> >       - Shorter intervals → less potential data loss, higher cost
> >       - Longer intervals → more potential data loss, lower cost
> >
> > ------------------------------
> > Conclusion & Next Steps
> >
> >    - This design adopts the *periodic external storage snapshot +
> recovery*
> >    pattern as the foundation for partition migration, HA, and restore.
> >    - Real-time replication/Primary-Replica will only be considered if
> >    operational requirements clearly justify it.
> >    - Integration with external storage will leverage Seatunnel’s built-in
> >    abstractions and logic wherever possible to minimize implementation
> >    complexity and operational overhead.
> >    - Further enhancements and optimizations will be addressed
> incrementally
> >    as real-world operational needs arise.
> >
> > ------------------------------
> > [Reference] Main Components in Current Architecture Diagram
> >
> >    - *ValueState* (Key-Value state storage, IMAP replacement)
> >    - *RocksDBStateBackend* (Manages ValueState and other states, instance
> >    creation/lookup)
> >    - *RocksDBPartitionService* (Controls partition ownership, distributed
> >    lookup, migration)
> >    - *CheckpointStorage/Factory* (Handles external storage integration,
> >    snapshot management)
> >    - *CheckpointCoordinator* (Orchestrates snapshot)
> >
> > → The architecture is designed to be extensible and compatible with
> future
> > State types (e.g., MapState), new storage backends, or distributed
> engines.
> >
> > https://github.com/apache/seatunnel/issues/9851 (Diagram is attached in
> the
> > issue comment.)
> >
> >    - *Note:* The actual implementation may differ in some details as
> >    development proceeds; the design will be refined iteratively based on
> >    practical feedback and technical findings.
> >
> >
> > Best regards,
> >
> > Doyeon
> >
> > (Github ID : dybyte)
>

Re: [DISCUSS] Replace Imap by rocksdb - V2

Reply via email to