[ https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241365#comment-15241365 ]
Yongjun Zhang commented on HDFS-9820: ------------------------------------- Hi [~jingzhao], Thanks a lot for your review and comments! Here is my reply in the same order of your questions, hope they make sense to you: # Without HDFS-10263 fix, internally I always use forward snapshot diff, and do transformation from there. Not sure if your first question implies you suggest we still use reversed diff that doesn't have HDFS-10263 fix, and translate the result to be symmetric as forward snapshot diff (same as what HDFS-10263 would have achieved). If so, because the result still need another (existing) transformation as we currently do, that would cause the complexity I referred to in HDFS-10263. # We now use {{-diff "" <ss>}} at command line to do the same behavior as {{-rdiff <ss>}} as in last patch rev. Due to lack of HDFS-10263, I swapped the source and target internally (and added the {{useRdiff}} flag to indicate the swapping), and always use forward snapshot diff. # Seems you mean we should allow user to pass snapshot names in any order, either {{-diff s1 s2}} or {{-diff s2 s1}}, and let the program to order s1 s2? What I was thinking was, we need to use the order user passed to indicate whether we are doing forward diff (HDFS-8828) or reverse diff (HDFS-9820). Thus {{-diff s1 s2}} and {{-diff s2 s1}} means different thing to me. I may have misunderstood you though. In addition, after HDFS-10263 is in place, we can make the implementation more symmetric (HDFS-8828 vs HDFS-9820). Thanks much. > Improve distcp to support efficient restore to an earlier snapshot > ------------------------------------------------------------------ > > Key: HDFS-9820 > URL: https://issues.apache.org/jira/browse/HDFS-9820 > Project: Hadoop HDFS > Issue Type: New Feature > Components: distcp > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, > HDFS-9820.003.patch, HDFS-9820.004.patch > > > HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are > some complexity and challenges. > HDFS-7535 improved distcp performance by avoiding copying files that changed > name since last backup. > On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data > from source to target cluster, by only copying changed files since last > backup. The way it works is use snapshot diff to find out all files changed, > and copy the changed files only. > See > https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ > This jira is to propose a variation of HDFS-8828, to find out the files > changed in target cluster since last snapshot sx, and copy these from the > source target's same snapshot sx, to restore target cluster to sx. > If a file/dir is > - renamed, rename it back > - created in target cluster, delete it > - modified, put it to the copy list > - run distcp with the copy list, copy from the source cluster's corresponding > snapshot > This could be a new command line switch -rdiff in distcp. > HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 > would hopefully be easier to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)