[ https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543394#comment-15543394 ]
Yongjun Zhang commented on HDFS-9820: ------------------------------------- Copied from https://issues.apache.org/jira/browse/HDFS-10314?focusedCommentId=15510391&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15510391 For clarity, and as a recap, here is a comparison table between -diff and the proposed -rdiff, which shows the symmetricity: ||Comparison||-diff s1 s2 <src> <tgt>||-rdiff s2 s1 <src> <tgt>|| |Current feature state|Existing in distcp|Proposed Addition | |Functionality| Given <tgt>'s current state is s1, make <tgt>'s current state the same as newer snapshot s2 | Given <tgt>'s current state is s2, make <tgt>'s current state the same as older snapshot s1 | |Requirements| # <src> and <tgt> need to be different paths # both <src> and <tgt> have snapshot s1 with exact same content # <src> has snapshot s2 # s2 is newer than s1 # <tgt>'s current state is the same as s1 # <tgt> doesn't have snapshot s2 | # <src> and <tgt> can be the same or different paths # both <src> and <tgt> have snapshot s1 with exact same content # <tgt> has snapshot s2 # s2 is newer than s1 # <tgt>'s current state is the same as s2 # <src> may or may not have snapshot s2 | |Steps|# calculate snapshotDiff<s1,s2> at <src> # apply rename/delete part of snapshotDiff on <tgt> # copy modified part of snapshotDiff from s2 of <src> to <tgt> | # calculate snapshotDiff<s2,s1> at <tgt> # apply rename/delete part of snapshotDiff on <tgt> # copy modified part of snapshotDiff from s1 of <src> to <tgt> | > Improve distcp to support efficient restore to an earlier snapshot > ------------------------------------------------------------------ > > Key: HDFS-9820 > URL: https://issues.apache.org/jira/browse/HDFS-9820 > Project: Hadoop HDFS > Issue Type: New Feature > Components: distcp > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, > HDFS-9820.003.patch, HDFS-9820.004.patch, HDFS-9820.005.patch > > > A common use scenario (scenaio 1): > # create snapshot sx in clusterX, > # do some experiemnts in clusterX, which creates some files. > # throw away the files changed and go back to sx. > Another scenario (scenario 2) is, there is a production cluster and a backup > cluster, we periodically sync up the data from production cluster to the > backup cluster with distcp. > The cluster in scenario 1 could be the backup cluster in scenario 2. > For scenario 1: > HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are > some complexity and challenges. Before that jira is implemented, we count on > distcp to copy from snapshot to the current state. However, the performance > of this operation could be very bad because we have to go through all files > even if we only changed a few files. > For scenario 2: > HDFS-7535 improved distcp performance by avoiding copying files that changed > name since last backup. > On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data > from source to target cluster, by only copying changed files since last > backup. The way it works is use snapshot diff to find out all files changed, > and copy the changed files only. > See > https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ > This jira is to propose a variation of HDFS-8828, to find out the files > changed in target cluster since last snapshot sx, and copy these from > snapshot sx of either the source or the target cluster, to restore target > cluster's current state to sx. > Specifically, > If a file/dir is > - renamed, rename it back > - created in target cluster, delete it > - modified, put it to the copy list > - run distcp with the copy list, copy from the source cluster's corresponding > snapshot > This could be a new command line switch -rdiff in distcp. > As a native restore feature, HDFS-4167 would still be ideal to have. However, > HDFS-9820 would hopefully be easier to implement, before HDFS-4167 is in > place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org