[
https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhao updated HDFS-7535:
----------------------------
Attachment: HDFS-7535.001.patch
Update the patch with new strategies to handle rename operations and also add
unit tests.
Currently for this feature we have the following assumptions:
# Both the source and target FileSystem must be DistributedFileSystem
# Two snapshots (e.g., s1 and s2) have been created on the source FS. The diff
between these two snapshots will be copied to the target FS.
# The target has the same snapshot s1. No changes have been made on the target
since s1. All the files/directories in the target are the same with source.s1
We verify these assumptions before the sync and we fallback to the default
distcp behavior if the assumptions do not stand. Note that for #3 currently we
only check the diff before the current target and target.s1 is empty, instead
of directly comparing target to source.s1. This may be fine since any failure
while applying the snapshot diff on the target will cause the distcp to copy
all the data.
The main challenge here is to translate the rename diffs to doable rename ops.
For example, if we have the following rename ops happening in the source:
1) /test --> /foo-tmp
2) /foo --> /test
3) /bar --> /foo
4) /foo-tmp --> /bar
The snapshot diff report now looks like:
R /foo --> /test
R /test --> /bar
R /bar --> /foo
This diff report cannot be directly applied. The current patch thus create a
tmp folder and breaks each rename op into two steps: move the source to the tmp
folder and then move the data from tmp to target. Then we only need to sort all
the first-phase renames based on the source paths (to make sure the files and
subdirs are moved before their parents/ancestors), and sort all the
second-phase renames based on the target paths (to make sure the parent
directories are created first).
> Utilize Snapshot diff report for distcp
> ---------------------------------------
>
> Key: HDFS-7535
> URL: https://issues.apache.org/jira/browse/HDFS-7535
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp, snapshots
> Reporter: Jing Zhao
> Assignee: Jing Zhao
> Attachments: HDFS-7535.000.patch, HDFS-7535.001.patch
>
>
> Currently HDFS snapshot diff report can identify file/directory creation,
> deletion, rename and modification under a snapshottable directory. We can use
> the diff report for distcp between the primary cluster and a backup cluster
> to avoid unnecessary data copy. This is especially useful when there is a big
> directory rename happening in the primary cluster: the current distcp cannot
> detect the rename op thus this rename usually leads to large amounts of real
> data copy.
> More details of the approach will come in the first comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)