[ 
https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248778#comment-15248778
 ] 

Yongjun Zhang edited comment on HDFS-10314 at 10/3/16 8:43 PM:
---------------------------------------------------------------

The idea is the wrap around distcp as a tool to achieve the functionality of 
distcp's switch -rdiff (if we will do the same for -diff, it will be a 
different jira). Here is a description and comparison of the -diff and 
unimplemented -rdiff switches. 

{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created 
earlier, and s2 is newer.

- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we 
apply 
  snapshotDiff(s1, s2)  on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That 
is, if
  we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.

Note: When we talk about source and target, we mean distcp source and distcp 
target.

A. -diff allows distcp to efficiently copy incremental changes made (on top of 
previously copied
    snapshot s1) in source cluster to target cluster   Assuming snapshot s2 is 
created at the source to
    capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental 
changes, the output of this
    operation is that the target will be at s2 sate. this operation involves 
three steps:

  A.1 calculate snapshotDiff(s1, s2) at the source
  A.2 apply the rename and delete portion of the snapshotDiff at the target. 
this step is called "sync"
  A.3 copy created/modified files from source's s2 to target 

B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite 
changes made in target
    after snapshot s1 was created in target. Assuming snapshot s2 is created at 
the target to capture
    the changes that need to be overwritten, snapshotDiff(s2, s1) is what we 
want to apply to target. 
    The output of this operation is that the target is at s1 state. Similar to 
-diff, but with some differences, 
    this operation involves three steps too:

  B.1 calculate snapshotDiff(s2, s1) at the target,
  B.2 apply the rename and delete portion of the snapshot diff at the target. 
this step is called "sync"
  B.3 copy created/modified files from source's s1 to target. (the source here 
can be a different
        cluster, or the target itself. When it's a different cluster, the 
cluster has to have snapshot s1 
        that's has exact same name and content as the s1 at the target)

A tablularized comparison:

                  required snapshots      DiffCalc       Output After Operation
                  --------------------------
                  source        target        
                  ------------------------------------------
-diff             s1, s2   ->  s1             source         target is at s2
-rdiff            s1       ->   s1,s2        target          target is at  s1  

(note, for -rdiff, the source could be the same as target)

So the "r" (reversed) in the -rdiff means the following and is very symmetric 
to -diff:

- swap the snapshot requirement of source and target in -diff 
  (from "s1, s2   ->   s1 "  to  "s1  ->   s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place  (from source to target)

We require source and target to have same snapshot s1 (same snapshot name, same 
content).
{code}



was (Author: yzhangal):
The idea is the wrap around distcp as a tool to achieve the functionality of 
distcp's switch -rdiff (if we will do the same for -diff, it will be a 
different jira). Here is a description and comparison of the -diff and 
unimplemented -rdiff switches. 

{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created 
earlier, and s1 is newer.

- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we 
apply 
  snapshotDiff(s1, s2)  on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That 
is, if
  we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.

Note: When we talk about source and target, we mean distcp source and distcp 
target.

A. -diff allows distcp to efficiently copy incremental changes made (on top of 
previously copied
    snapshot s1) in source cluster to target cluster   Assuming snapshot s2 is 
created at the source to
    capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental 
changes, the output of this
    operation is that the target will be at s2 sate. this operation involves 
three steps:

  A.1 calculate snapshotDiff(s1, s2) at the source
  A.2 apply the rename and delete portion of the snapshotDiff at the target. 
this step is called "sync"
  A.3 copy created/modified files from source's s2 to target 

B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite 
changes made in target
    after snapshot sx was created in target. Assuming snapshot s2 is created at 
the target to capture
    the changes that need to be overwritten, snapshotDiff(s2, s1) is what we 
want to apply to target. 
    The output of this operation is that the target is at s1 state. Similar to 
-diff, but with differences, 
    this operation involves three steps too:

  B.1 calculate snapshotDiff(s2, s1) at the target,
  B.2 apply the rename and delete portion of the snapshot diff at the target. 
this step is called "sync"
  B.3 copy created/modified files from source's s1 to target. (the source here 
can be a different
        cluster, or the target itself. When it's a different cluster, the 
cluster has to have snapshot s1 
        that's has exact same name and content as the s1 at the target)

A tablularized comparison:

                  required snapshots      DiffCalc       Output After Operation
                  --------------------------
                  source        target        
                  ------------------------------------------
-diff             s1, s2   ->  s1             source         target is at s2
-rdiff            s1       ->   s1,s2        target          target is at  s1  

(note, for -rdiff, the source could be the same as target)

So the "r" (reversed) in the -rdiff means the following and is very symmetric 
to -diff:

- swap the snapshot requirement of source and target in -diff 
  (from "s1, s2   ->   s1 "  to  "s1  ->   s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place  (from source to target)

We require source and target to have same snapshot s1 (same snapshot name, same 
content).
{code}


> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10314.001.patch
>
>
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of 
> -diff switch. 
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps 
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux 
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality 
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to