[
https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248896#comment-14248896
]
Jing Zhao commented on HDFS-7535:
---------------------------------
A typical scenario using snapshot for distcp can be like this: every time we
start distcp between the primary cluster and the backup cluster, a snapshot is
first created in the primary cluster. Then the snapshot diff report is computed
between the latest snapshot and the snapshot created for the last distcp. This
snapshot diff report represents the delta that should be applied to the backup
cluster. For changes like deletion and rename we can directly apply the same
operations (following some specific order based on their dependency) in the
backup cluster. For changes like creation, append, and other metadata
modification we keep using the functionality of the current distcp. In this
approach, we can avoid unnecessary data copy and also guarantee the source data
is immutable since our snapshot is read-only.
We plan to use this jira to provide the basic functionalities in the above
approach. More specifically, we can first add extra options to the current
distcp tool so that it can compute the dalta based on the diff report of two
given snapshot names. How to manage snapshots in the source/target clusters can
be done in separate jiras or through separate tools.
> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
> Key: HDFS-7535
> URL: https://issues.apache.org/jira/browse/HDFS-7535
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Jing Zhao
> Assignee: Jing Zhao
>
> Currently HDFS snapshot diff report can identify file/directory creation,
> deletion, rename and modification under a snapshottable directory. We can use
> the diff report for distcp between the primary cluster and a backup cluster
> to avoid unnecessary data copy. This is especially useful when there is a big
> directory rename happening in the primary cluster: the current distcp cannot
> detect the rename op thus this rename usually leads to large amounts of real
> data copy.
> More details of the approach will come in the first comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)