[ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543394#comment-15543394
 ] 

Yongjun Zhang commented on HDFS-9820:
-------------------------------------

Copied from 
https://issues.apache.org/jira/browse/HDFS-10314?focusedCommentId=15510391&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15510391

For clarity, and as a recap, here is a comparison table between -diff and the 
proposed -rdiff, which shows the symmetricity:

||Comparison||-diff s1 s2 <src> <tgt>||-rdiff s2 s1 <src> <tgt>||
|Current feature state|Existing in distcp|Proposed Addition |
|Functionality| Given <tgt>'s current state is s1, make <tgt>'s current state 
the same as newer snapshot s2 | Given <tgt>'s current state is s2, make <tgt>'s 
current state the same as older snapshot s1 | 
|Requirements| # <src> and <tgt> need to be different paths
# both <src> and <tgt> have snapshot s1 with exact same content 
# <src> has snapshot s2
# s2 is newer than s1
# <tgt>'s current state is the same as s1
# <tgt> doesn't have snapshot s2 | # <src> and <tgt> can be the same or 
different paths
# both <src> and <tgt> have snapshot s1 with exact same content
# <tgt> has snapshot s2
#  s2 is newer than s1 
# <tgt>'s current state is the same as s2
# <src> may or may not have snapshot s2 |
|Steps|# calculate snapshotDiff<s1,s2> at <src> 
# apply rename/delete part of snapshotDiff on <tgt> 
# copy modified part of snapshotDiff from s2 of <src> to <tgt> | # calculate 
snapshotDiff<s2,s1> at <tgt> 
# apply rename/delete part of snapshotDiff on <tgt> 
# copy modified part of snapshotDiff from s1 of <src> to <tgt> |


> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, 
> HDFS-9820.003.patch, HDFS-9820.004.patch, HDFS-9820.005.patch
>
>
> A common use scenario (scenaio 1): 
> # create snapshot sx in clusterX, 
> # do some experiemnts in clusterX, which creates some files. 
> # throw away the files changed and go back to sx.
> Another scenario (scenario 2) is, there is a production cluster and a backup 
> cluster, we periodically sync up the data from production cluster to the 
> backup cluster with distcp. 
> The cluster in scenario 1 could be the backup cluster in scenario 2.
> For scenario 1:
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges.  Before that jira is implemented, we count on 
> distcp to copy from snapshot to the current state. However, the performance 
> of this operation could be very bad because we have to go through all files 
> even if we only changed a few files.
> For scenario 2:
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from 
> snapshot sx of either the source or the target cluster, to restore target 
> cluster's current state to sx. 
> Specifically,
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> As a native restore feature, HDFS-4167 would still be ideal to have. However, 
>  HDFS-9820 would hopefully be easier to implement, before HDFS-4167 is in 
> place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to