[
https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248778#comment-15248778
]
Yongjun Zhang edited comment on HDFS-10314 at 4/19/16 10:31 PM:
----------------------------------------------------------------
The idea is the wrap around distcp as a tool to achieve the functionality of
distcp's switch -rdiff (if we will do the same for -diff, it will be a
different jira). Here is a description and comparison of the -diff and
unimplemented -rdiff switches.
{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created
earlier, and s1 is newer.
- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we
apply
snapshotDiff(s1, s2) on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That
is, if
we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.
Note: When we talk about source and target, we mean distcp source and distcp
target.
A. -diff allows distcp to efficiently copy incremental changes made (on top of
previously copied
snapshot s1) in source cluster to target cluster Assuming snapshot s2 is
created at the source to
capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental
changes, the output of this
operation is that the target will be at s2 sate. this operation involves
three steps:
A.1 calculate snapshotDiff(s1, s2) at the source
A.2 apply the rename and delete portion of the snapshotDiff at the target.
this step is called "sync"
A.3 copy created/modified files from source's s2 to target
B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite
changes made in target
after snapshot sx was created in target. Assuming snapshot s2 is created at
the target to capture
the changes that need to be overwritten, snapshotDiff(s2, s1) is what we
want to apply to target.
The output of this operation is that the target is at s1 state. Similar to
-diff, but with differences,
this operation involves three steps too:
B.1 calculate snapshotDiff(s2, s1) at the target,
B.2 apply the rename and delete portion of the snapshot diff at the target.
this step is called "sync"
B.3 copy created/modified files from source's s1 to target. (the source here
can be a different
cluster, or the target itself. When it's a different cluster, the
cluster has to have snapshot s1
that's has exact same name and content as the s1 at the target)
A tablularized comparison:
required snapshots DiffCalc Output After Operation
--------------------------
source target
------------------------------------------
-diff s1, s2 -> s1 source target is at s2
-rdiff s1 -> s1,s2 target target is at s1
(note, for -rdiff, the source could be the same as target)
So the "r" (reversed) in the -rdiff means the following and is very symmetric
to -diff:
- swap the snapshot requirement of source and target in -diff
(from "s1, s2 -> s1 " to "s1 -> s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place (from source to target)
We require source and target to have same snapshot s1 (same snapshot name, same
content).
{code}
was (Author: yzhangal):
The idea is the wrap around distcp as a tool to achieve the functionality of
distcp's switch -rdiff (if we will do the same for -diff, it will be a
different jira). Here is a description and comparison of the -diff and
unimplemented -rdiff switches.
{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created
earlier, and s1 is newer.
- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we
apply
snapshotDiff(s1, s2) on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That
is, if
we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.
Note: When we talk about source and target, we mean distcp source and distcp
target.
A. -diff allows distcp to efficiently copy incremental changes made (on top of
previously copied
snapshot s1) in source cluster to target cluster Assuming snapshot s2 is
created at the source to
capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental
changes, the output of this
operation is that the target will be at s2 sate. this operation involves
three steps:
A.1 calculate snapshotDiff(s1, s2) at the source
A.2 apply the rename and delete portion of the snapshotDiff at the target.
this step is called "sync"
A.3 copy created/modified files from source's s2 to target
B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite
changes made in target
after snapshot sx was created in target. Assuming snapshot s2 is created at
the target to capture
the changes that need to be overwritten, snapshotDiff(s2, s1) is what we
want to apply to target.
The output of this operation is that the target is at s1 state. Similar to
-diff, but with differences,
this operation involves three steps too:
B.1 calculate snapshotDiff(s2, s1) at the target,
B.2 apply the rename and delete portion of the snapshot diff at the target.
this step is called "sync"
B.3 copy created/modified files from source's s1 to target. (the source here
can be a different
cluster, or the target itself. When it's a different cluster, the
cluster has to have snapshot s1
that's has exact same name and content as the s1 at the target)
A tablularized comparison:
required snapshots DiffCalc Output After Operation
--------------------------
source target
------------------------------------------
-diff s1, s2 -> s1 source target is at s2
-rdiff s1 -> s1,s2 target target is at s1
(note, for -rdiff, the source could be the same as target)
So the "r" (reversed) in the -rdiff means the following:
- swap the snapshot requirement of source and target in -diff
(from "s1, s2 -> s1 " to "s1 -> s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place (from source to target)
We require source and target to have same snapshot s1 (same snapshot name, same
content).
{code}
> Propose a new tool that wraps around distcp to "restore" changes on target
> cluster
> ----------------------------------------------------------------------------------
>
> Key: HDFS-10314
> URL: https://issues.apache.org/jira/browse/HDFS-10314
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: tools
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
>
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of
> -diff switch.
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName> <toSnapshotName> <source> <target>
> {code}
> This command ensure <fromSnapshotName> is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName> <toSnapshotName> <source> <target>
> {code}
> that ensures <fromSnapshotName> is older than <toSnapshotName>.
> Thanks [~jingzhao].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)