[ https://issues.apache.org/jira/browse/HDFS-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jing Zhao updated HDFS-7535: ---------------------------- Attachment: HDFS-7535.001.patch Update the patch with new strategies to handle rename operations and also add unit tests. Currently for this feature we have the following assumptions: # Both the source and target FileSystem must be DistributedFileSystem # Two snapshots (e.g., s1 and s2) have been created on the source FS. The diff between these two snapshots will be copied to the target FS. # The target has the same snapshot s1. No changes have been made on the target since s1. All the files/directories in the target are the same with source.s1 We verify these assumptions before the sync and we fallback to the default distcp behavior if the assumptions do not stand. Note that for #3 currently we only check the diff before the current target and target.s1 is empty, instead of directly comparing target to source.s1. This may be fine since any failure while applying the snapshot diff on the target will cause the distcp to copy all the data. The main challenge here is to translate the rename diffs to doable rename ops. For example, if we have the following rename ops happening in the source: 1) /test --> /foo-tmp 2) /foo --> /test 3) /bar --> /foo 4) /foo-tmp --> /bar The snapshot diff report now looks like: R /foo --> /test R /test --> /bar R /bar --> /foo This diff report cannot be directly applied. The current patch thus create a tmp folder and breaks each rename op into two steps: move the source to the tmp folder and then move the data from tmp to target. Then we only need to sort all the first-phase renames based on the source paths (to make sure the files and subdirs are moved before their parents/ancestors), and sort all the second-phase renames based on the target paths (to make sure the parent directories are created first). > Utilize Snapshot diff report for distcp > --------------------------------------- > > Key: HDFS-7535 > URL: https://issues.apache.org/jira/browse/HDFS-7535 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots > Reporter: Jing Zhao > Assignee: Jing Zhao > Attachments: HDFS-7535.000.patch, HDFS-7535.001.patch > > > Currently HDFS snapshot diff report can identify file/directory creation, > deletion, rename and modification under a snapshottable directory. We can use > the diff report for distcp between the primary cluster and a backup cluster > to avoid unnecessary data copy. This is especially useful when there is a big > directory rename happening in the primary cluster: the current distcp cannot > detect the rename op thus this rename usually leads to large amounts of real > data copy. > More details of the approach will come in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)