[ 
https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7535:
----------------------------
    Attachment: HDFS-7535.001.patch

Update the patch with new strategies to handle rename operations and also add 
unit tests.

Currently for this feature we have the following assumptions:
# Both the source and target FileSystem must be DistributedFileSystem
# Two snapshots (e.g., s1 and s2) have been created on the source FS. The diff 
between these two snapshots will be copied to the target FS.
# The target has the same snapshot s1. No changes have been made on the target 
since s1. All the files/directories in the target are the same with source.s1

We verify these assumptions before the sync and we fallback to the default 
distcp behavior if the assumptions do not stand. Note that for #3 currently we 
only check the diff before the current target and target.s1 is empty, instead 
of directly comparing target to source.s1. This may be fine since any failure 
while applying the snapshot diff on the target will cause the distcp to copy 
all the data.

The main challenge here is to translate the rename diffs to doable rename ops. 
For example, if we have the following rename ops happening in the source:
1) /test --> /foo-tmp
2) /foo --> /test
3) /bar --> /foo
4) /foo-tmp --> /bar

The snapshot diff report now looks like:
R /foo --> /test
R /test --> /bar
R /bar --> /foo

This diff report cannot be directly applied. The current patch thus create a 
tmp folder and breaks each rename op into two steps: move the source to the tmp 
folder and then move the data from tmp to target. Then we only need to sort all 
the first-phase renames based on the source paths (to make sure the files and 
subdirs are moved before their parents/ancestors), and sort all the 
second-phase renames based on the target paths (to make sure the parent 
directories are created first).

> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
>                 Key: HDFS-7535
>                 URL: https://issues.apache.org/jira/browse/HDFS-7535
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-7535.000.patch, HDFS-7535.001.patch
>
>
> Currently HDFS snapshot diff report can identify file/directory creation, 
> deletion, rename and modification under a snapshottable directory. We can use 
> the diff report for distcp between the primary cluster and a backup cluster 
> to avoid unnecessary data copy. This is especially useful when there is a big 
> directory rename happening in the primary cluster: the current distcp cannot 
> detect the rename op thus this rename usually leads to large amounts of real 
> data copy.
> More details of the approach will come in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to