Hi @dybyte . Thanks for you work! I think we should only keep replace imap data by rocksdb. The checkpoint data work fine now. So at this stage, we are focusing on replacing IMAP. Checkpoint data does not use IMAP.
Regarding the data partitioning issue, I think we can abandon storing data in different locations and simply store all data on the master node. Other nodes can retrieve data by sending RPC requests to the master. RocksDB can store files on S3. This way, when the master node changes, all data can still be retrieved from S3 without having to maintain data accuracy ourselves. Doyeon Kim <[email protected]> 于2025年10月3日周五 20:13写道: > > Seatunnel RocksDB-based State Backend V2 Design Doc > ------------------------------ > 1. Partition Rebalancing & Data Migration Motivation and Problem Definition > > - When cluster topology changes (e.g., node join/leave/failure), > partition (Shard/KeyGroup) ownership must be reassigned. > - Partition data stored on the previous owner node must be automatically > migrated to the new owner node. > - If not implemented, there can be data inaccessibility or temporary > data loss during partition migration. > > Design Approach (Reflected in Architecture/Diagrams) > > - *External Storage-Based Partition Migration* > - Each partition’s RocksDB instance periodically uploads a snapshot > to external storage (HDFS, Local FS etc). > - When partition ownership transfer is decided (by PartitionService), > the current owner node creates and uploads the latest snapshot. > - The new owner node downloads and restores the partition snapshot > from external storage. > - Data movement is “file-based” for simplicity and to avoid > lock/network concurrency issues. > - *PartitionService Event Trigger* > - PartitionService detects and broadcasts partition ownership change > events (leveraging Hazelcast's MembershipListener, MigrationListener, > PartitionLostListener). > - Write/read operations for the affected partition are restricted > until ownership change and restore are complete (to ensure consistency). > > Open Issues & Discussion Points > > - One open point is how to ensure partition data consistency during > migration (e.g., blocking or synchronizing writes before snapshot). Do you > think there could be a better alternative? > - > > ------------------------------ > 2. Replication & High Availability (HA) Motivation and Problem Definition > > - Data loss must be prevented if a single node fails. > - “Always recoverable” data is required for production environments. > > Design Approach (Reflected in Diagrams) > > - *Periodic External Storage Snapshots for HA* > - Each partition’s RocksDB instance saves a snapshot to external > storage (HDFS, Local, etc) at a configured interval. > - On node failure, the affected partition is reassigned to a new > node, which downloads and restores the latest snapshot. > - This minimizes service downtime and data loss (within the snapshot > interval). > - *Real-Time Replication (Primary/Replica)* > - For large-scale RocksDB state or ETL pipelines, real-time > replication brings significant complexity and performance > overhead, and is > often less effective operationally. > - The default is periodic snapshots; real-time replication may be > considered only if explicitly required. > > Open Issues & Discussion Points > > - Assessing the need for real-time replication/Primary-Replica (periodic > snapshots are the default) > - Balancing snapshot interval, performance, and storage cost: > - Shorter intervals → less potential data loss, higher cost > - Longer intervals → more potential data loss, lower cost > > ------------------------------ > Conclusion & Next Steps > > - This design adopts the *periodic external storage snapshot + recovery* > pattern as the foundation for partition migration, HA, and restore. > - Real-time replication/Primary-Replica will only be considered if > operational requirements clearly justify it. > - Integration with external storage will leverage Seatunnel’s built-in > abstractions and logic wherever possible to minimize implementation > complexity and operational overhead. > - Further enhancements and optimizations will be addressed incrementally > as real-world operational needs arise. > > ------------------------------ > [Reference] Main Components in Current Architecture Diagram > > - *ValueState* (Key-Value state storage, IMAP replacement) > - *RocksDBStateBackend* (Manages ValueState and other states, instance > creation/lookup) > - *RocksDBPartitionService* (Controls partition ownership, distributed > lookup, migration) > - *CheckpointStorage/Factory* (Handles external storage integration, > snapshot management) > - *CheckpointCoordinator* (Orchestrates snapshot) > > → The architecture is designed to be extensible and compatible with future > State types (e.g., MapState), new storage backends, or distributed engines. > > https://github.com/apache/seatunnel/issues/9851 (Diagram is attached in the > issue comment.) > > - *Note:* The actual implementation may differ in some details as > development proceeds; the design will be refined iteratively based on > practical feedback and technical findings. > > > Best regards, > > Doyeon > > (Github ID : dybyte)
