[ 
https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248896#comment-14248896
 ] 

Jing Zhao commented on HDFS-7535:
---------------------------------

A typical scenario using snapshot for distcp can be like this: every time we 
start distcp between the primary cluster and the backup cluster, a snapshot is 
first created in the primary cluster. Then the snapshot diff report is computed 
between the latest snapshot and the snapshot created for the last distcp. This 
snapshot diff report represents the delta that should be applied to the backup 
cluster. For changes like deletion and rename we can directly apply the same 
operations (following some specific order based on their dependency) in the 
backup cluster. For changes like creation, append, and other metadata 
modification we keep using the functionality of the current distcp. In this 
approach, we can avoid unnecessary data copy and also guarantee the source data 
is immutable since our snapshot is read-only.

We plan to use this jira to provide the basic functionalities in the above 
approach. More specifically, we can first add extra options to the current 
distcp tool so that it can compute the dalta based on the diff report of two 
given snapshot names. How to manage snapshots in the source/target clusters can 
be done in separate jiras or through separate tools.

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>
> Currently HDFS snapshot diff report can identify file/directory creation, 
> deletion, rename and modification under a snapshottable directory. We can use 
> the diff report for distcp between the primary cluster and a backup cluster 
> to avoid unnecessary data copy. This is especially useful when there is a big 
> directory rename happening in the primary cluster: the current distcp cannot 
> detect the rename op thus this rename usually leads to large amounts of real 
> data copy.
> More details of the approach will come in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to