[ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231139#comment-15231139
 ] 

Yongjun Zhang commented on HDFS-9820:
-------------------------------------

Hi [~jingzhao],

About your other comments:

{quote}
2. Currently rdiff is a standalone option for distcp. This means we're using 
distcp to do the restore. To restore a directory back to a snapshot, this may 
not be the most efficient way compared with a local restoring solution 
(HDFS-4167), which can avoid most of the unnecessary data copying and can 
provide a copy-on-write semantic when supporting restoring appended/truncated 
files.
{quote}
I agree that HDFS-4167 is more efficient because it's native. However, there is 
quite some complexity there.  Before we have HDFS-4167, I'm hoping HDFS-9820 
can be a simpler solution and reasonably fast, especially with the ground work 
in HDFS-7535 and HDFS-8828. For appended/truncated data, MAPREDUCE-6572 would 
help when implemented. The idea there is to remember the truncated length of 
file, and copy only changed data.

{quote}
3. But before we finish the work in HDFS-4167, maybe we can augment the current 
diff-based distcp by allowing the admin to choose to restore the target back to 
the latest snapshot. We can still use the implementation in the current patch, 
but instead of adding a new rdiff option for distcp, we add a "--force" option 
to the current diff-based distcp. What do you think?
{quote}

In my opinion, the functions of {{-diff}} and {{-rdiff}} are quite different, 
the former is to copy changed diff from source cluster to target; the latter is 
to revert changes made in target (or any) cluster to a snapshot point. So I 
personally think it's more clear to use two different command options, thus 
less error-prone, from user's perspective. In addition, {{-diff}} requires two 
snapshot names as parameters, and {{-rdiff}} just need one.  

Wonder if you agree?

Thanks a lot.


> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch
>
>
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from the 
> source target's same snapshot sx, to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 
> would hopefully be easier to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to