[ https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248778#comment-15248778 ]
Yongjun Zhang edited comment on HDFS-10314 at 10/3/16 8:43 PM: --------------------------------------------------------------- The idea is the wrap around distcp as a tool to achieve the functionality of distcp's switch -rdiff (if we will do the same for -diff, it will be a different jira). Here is a description and comparison of the -diff and unimplemented -rdiff switches. {code} Definition: Assuming we have two snapshots, s1 and s2, where s1 is created earlier, and s2 is newer. - SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we apply snapshotDiff(s1, s2) on top of s1, we can go to the state of s2. - SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That is, if we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1. Note: When we talk about source and target, we mean distcp source and distcp target. A. -diff allows distcp to efficiently copy incremental changes made (on top of previously copied snapshot s1) in source cluster to target cluster Assuming snapshot s2 is created at the source to capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental changes, the output of this operation is that the target will be at s2 sate. this operation involves three steps: A.1 calculate snapshotDiff(s1, s2) at the source A.2 apply the rename and delete portion of the snapshotDiff at the target. this step is called "sync" A.3 copy created/modified files from source's s2 to target B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite changes made in target after snapshot s1 was created in target. Assuming snapshot s2 is created at the target to capture the changes that need to be overwritten, snapshotDiff(s2, s1) is what we want to apply to target. The output of this operation is that the target is at s1 state. Similar to -diff, but with some differences, this operation involves three steps too: B.1 calculate snapshotDiff(s2, s1) at the target, B.2 apply the rename and delete portion of the snapshot diff at the target. this step is called "sync" B.3 copy created/modified files from source's s1 to target. (the source here can be a different cluster, or the target itself. When it's a different cluster, the cluster has to have snapshot s1 that's has exact same name and content as the s1 at the target) A tablularized comparison: required snapshots DiffCalc Output After Operation -------------------------- source target ------------------------------------------ -diff s1, s2 -> s1 source target is at s2 -rdiff s1 -> s1,s2 target target is at s1 (note, for -rdiff, the source could be the same as target) So the "r" (reversed) in the -rdiff means the following and is very symmetric to -diff: - swap the snapshot requirement of source and target in -diff (from "s1, s2 -> s1 " to "s1 -> s1,s2") - swap the result snapshot after operation (from s2 to s1) - swap the snapshot diff calculation place (from source to target) We require source and target to have same snapshot s1 (same snapshot name, same content). {code} was (Author: yzhangal): The idea is the wrap around distcp as a tool to achieve the functionality of distcp's switch -rdiff (if we will do the same for -diff, it will be a different jira). Here is a description and comparison of the -diff and unimplemented -rdiff switches. {code} Definition: Assuming we have two snapshots, s1 and s2, where s1 is created earlier, and s1 is newer. - SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we apply snapshotDiff(s1, s2) on top of s1, we can go to the state of s2. - SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That is, if we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1. Note: When we talk about source and target, we mean distcp source and distcp target. A. -diff allows distcp to efficiently copy incremental changes made (on top of previously copied snapshot s1) in source cluster to target cluster Assuming snapshot s2 is created at the source to capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental changes, the output of this operation is that the target will be at s2 sate. this operation involves three steps: A.1 calculate snapshotDiff(s1, s2) at the source A.2 apply the rename and delete portion of the snapshotDiff at the target. this step is called "sync" A.3 copy created/modified files from source's s2 to target B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite changes made in target after snapshot sx was created in target. Assuming snapshot s2 is created at the target to capture the changes that need to be overwritten, snapshotDiff(s2, s1) is what we want to apply to target. The output of this operation is that the target is at s1 state. Similar to -diff, but with differences, this operation involves three steps too: B.1 calculate snapshotDiff(s2, s1) at the target, B.2 apply the rename and delete portion of the snapshot diff at the target. this step is called "sync" B.3 copy created/modified files from source's s1 to target. (the source here can be a different cluster, or the target itself. When it's a different cluster, the cluster has to have snapshot s1 that's has exact same name and content as the s1 at the target) A tablularized comparison: required snapshots DiffCalc Output After Operation -------------------------- source target ------------------------------------------ -diff s1, s2 -> s1 source target is at s2 -rdiff s1 -> s1,s2 target target is at s1 (note, for -rdiff, the source could be the same as target) So the "r" (reversed) in the -rdiff means the following and is very symmetric to -diff: - swap the snapshot requirement of source and target in -diff (from "s1, s2 -> s1 " to "s1 -> s1,s2") - swap the result snapshot after operation (from s2 to s1) - swap the snapshot diff calculation place (from source to target) We require source and target to have same snapshot s1 (same snapshot name, same content). {code} > A new tool to sync current HDFS view to specified snapshot > ---------------------------------------------------------- > > Key: HDFS-10314 > URL: https://issues.apache.org/jira/browse/HDFS-10314 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-10314.001.patch > > > HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of > -diff switch. > Upon discussion with [~jingzhao], we will introduce a new tool that wraps > around distcp to achieve the same purpose. > I'm thinking about calling the new tool "rsync", similar to unix/linux > command "rsync". The "r" here means remote. > The syntax that simulate -rdiff behavior proposed in HDFS-9820 is > {code} > rsync <fromSnapshotName> <toSnapshotName> <source> <target> > {code} > This command ensure <fromSnapshotName> is newer than <toSnapshotName>. > I think, In the future, we can add another command to have the functionality > of -diff switch of distcp. > {code} > sync <fromSnapshotName> <toSnapshotName> <source> <target> > {code} > that ensures <fromSnapshotName> is older than <toSnapshotName>. > Thanks [~jingzhao]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org