[
https://issues.apache.org/jira/browse/HDDS-12307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932140#comment-17932140
]
Ritesh Shukla commented on HDDS-12307:
--------------------------------------
Excellent proposal! cc [~prashantpogde]
> Realtime Cross-Region Bucket Replication
> ----------------------------------------
>
> Key: HDDS-12307
> URL: https://issues.apache.org/jira/browse/HDDS-12307
> Project: Apache Ozone
> Issue Type: New Feature
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> Currently, there are a few cross-regions (geo-replicated) DR solution for
> Ozone bucket
> * Run a periodic distcp from the source bucket to the target bucket
> * Take a snapshot on the bucket and send it to the remote site DR sites
> ([https://ozone.apache.org/docs/edge/feature/snapshot.html])
> There are pros and cons for the current approach
> * Pros
> ** It is simpler: Setting up periodic jobs can be done quite easily (e.g.
> using cronjobs)
> ** Distcp implementation will setup a map reduce jobs that will parallelize
> the copy from the source and the cluster
> ** No additional Ozone components needed
> * Cons
> ** It is not “realtime”: The freshness of the data depends on how frequent
> and how fast the distcp runs
> ** It incurs significant overhead to the source cluster: It requires
> scanning of all the files in the source cluster (possibly in the target
> cluster)
> Cloudera Replication Manager
> ([https://docs.cloudera.com/replication-manager/1.5.4/replication-policies/topics/rm-pvce-understand-ozone-replication-policy.html])
> adds an incremental replication support after the initial bootstrap step by
> taking the snapdiff between two snapshots. This is better since there is no
> need to list all the keys under the bucket again, but it's not technically
> realtime.
> This ticket is track possible solutions for a realtime bucket async
> replication between two clusters in different regions (with 100+ms latency).
> The current idea is have a CDC (Change Data Capture) framework on the OM
> which will be sent to a Replication subsystem (Replication Queue will receive
> the delta from the CDC and Syncer (Replicator) will process the delta in the
> queue and replicate it to the target cluster).
> * The CDC component can subscribe and receive any updates for the buckets
> through the gRPC bidirectional streaming APIs
> ** The choice of the "change data" can be either Raft logs / RocksDB WAL
> logs / OM Audit logs
> * Replication Queue can be something like Kafka topic / Ratis logservice
> (https://issues.apache.org/jira/browse/RATIS-271) / local persistent queue
> (e.g. Chronicle Queue)
> * Replication subsystem can use a master-worker pattern
> ** The master will consume the replications from the replication queue and
> assign a worker to run the cross-region replication tasks
> *** If any worker failed, the master will reassign the replication to
> another task
> ** The worker will run the replication task assigned from the master
> ** The master can be an OM service (e.g. source cluster OM service)
> * Replication subsystem can read from OM follower / listener to not affect
> the OM leader
> ** OM listener would be better since it will never be promoted to a leader,
> while OM follower might, which might require additional logic to change OM
> follower to change.
> There are two major steps when setting up the bucket replication
> # Initial batch replication of source bucket (bootstrap)
> ** Backfill the newly created bucket with the existing objects from the
> source bucket
> ** The replication subsystem will take the bucket snapshot, divide the keys
> into partitions and send to the workers for replications
> # Asynchronous bucket live replication
> ** Tailing the subsequent changes (CDC) of the source bucket and replicating
> it to the destination bucket
> ** There are two phases
> *** Incremental snapshot: Similar to the Cloudera Replication Manager, the
> replication subsystem takes periodic snapshot and uses the snapdiff between
> the previous bucket snapshot and the current bucket snapshot and replicate
> the changes to the target bucket.
> **** This would be the baseline implementation.
> *** Near-realtime bucket replication: The main idea is that each Syncer
> (Replicator) will create a persistent gRPC bidirection streaming channel
> (using HTTP/2) with one of the OM nodes that will send the log entries
> related to keys for the specific bucket to be created through a Log Tailer.
> The Syncer will then persist the log entries to its internal persistent queue
> which will be consumed by a work pool to replicate the data to the
> destination bucket.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]