[ 
https://issues.apache.org/jira/browse/HDDS-12307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932140#comment-17932140
 ] 

Ritesh Shukla commented on HDDS-12307:
--------------------------------------

Excellent proposal! cc [~prashantpogde] 

> Realtime Cross-Region Bucket Replication
> ----------------------------------------
>
>                 Key: HDDS-12307
>                 URL: https://issues.apache.org/jira/browse/HDDS-12307
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, there are a few cross-regions (geo-replicated) DR solution for 
> Ozone bucket
>  * Run a periodic distcp from the source bucket to the target bucket
>  * Take a snapshot on the bucket and send it to the remote site DR sites 
> ([https://ozone.apache.org/docs/edge/feature/snapshot.html])
> There are pros and cons for the current approach
>  * Pros
>  ** It is simpler: Setting up periodic jobs can be done quite easily (e.g. 
> using cronjobs)
>  ** Distcp implementation will setup a map reduce jobs that will parallelize 
> the copy from the source and the cluster
>  ** No additional Ozone components needed
>  * Cons
>  ** It is not “realtime”: The freshness of the data depends on how frequent 
> and how fast the distcp runs
>  ** It incurs significant overhead to the source cluster: It requires 
> scanning of all the files in the source cluster (possibly in the target 
> cluster)
> Cloudera Replication Manager 
> ([https://docs.cloudera.com/replication-manager/1.5.4/replication-policies/topics/rm-pvce-understand-ozone-replication-policy.html])
>  adds an incremental replication support after the initial bootstrap step by 
> taking the snapdiff between two snapshots. This is better since there is no 
> need to list all the keys under the bucket again, but it's not technically 
> realtime.
> This ticket is track possible solutions for a realtime bucket async 
> replication between two clusters in different regions (with 100+ms latency). 
> The current idea is have a CDC (Change Data Capture) framework on the OM 
> which will be sent to a Replication subsystem (Replication Queue will receive 
> the delta from the CDC and Syncer (Replicator) will process the delta in the 
> queue and replicate it to the target cluster).
>  * The CDC component can subscribe and receive any updates for the buckets 
> through the gRPC bidirectional streaming APIs
>  ** The choice of the "change data" can be either Raft logs / RocksDB WAL 
> logs / OM Audit logs
>  * Replication Queue can be something like Kafka topic / Ratis logservice 
> (https://issues.apache.org/jira/browse/RATIS-271) / local persistent queue 
> (e.g. Chronicle Queue)
>  * Replication subsystem can use a master-worker pattern
>  ** The master will consume the replications from the replication queue and 
> assign a worker to run the cross-region replication tasks
>  *** If any worker failed, the master will reassign the replication to 
> another task
>  ** The worker will run the replication task assigned from the master
>  ** The master can be an OM service (e.g. source cluster OM service)
>  * Replication subsystem can read from OM follower / listener to not affect 
> the OM leader
>  ** OM listener would be better since it will never be promoted to a leader, 
> while OM follower might, which might require additional logic to change OM 
> follower to change.
> There are two major steps when setting up the bucket replication
>  # Initial batch replication of source bucket (bootstrap)
>  ** Backfill the newly created bucket with the existing objects from the 
> source bucket
>  ** The replication subsystem will take the bucket snapshot, divide the keys 
> into partitions and send to the workers for replications
>  # Asynchronous bucket live replication
>  ** Tailing the subsequent changes (CDC) of the source bucket and replicating 
> it to the destination bucket
>  ** There are two phases
>  *** Incremental snapshot: Similar to the Cloudera Replication Manager, the 
> replication subsystem takes periodic snapshot and uses the snapdiff between 
> the previous bucket snapshot and the current bucket snapshot and replicate 
> the changes to the target bucket.
>  **** This would be the baseline implementation.
>  *** Near-realtime bucket replication: The main idea is that each Syncer 
> (Replicator) will create a persistent gRPC bidirection streaming channel 
> (using HTTP/2) with one of the OM nodes that will send the log entries 
> related to keys for the specific bucket to be created through a Log Tailer. 
> The Syncer will then persist the log entries to its internal persistent queue 
> which will be consumed by a work pool to replicate the data to the 
> destination bucket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to