[
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongjun Zhang updated HDFS-9820:
--------------------------------
Summary: Improve distcp to support efficient restore to an earlier snapshot
(was: Improve distcp to support efficient restore)
> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
> Key: HDFS-9820
> URL: https://issues.apache.org/jira/browse/HDFS-9820
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: distcp
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
>
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are
> some complexity and challenges.
> HDFS-7535 improved distcp performance by avoiding copying files that changed
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data
> from source to target cluster, by only copying changed files since last
> backup. The way it works is use snapshot diff to find out all files changed,
> and copy the changed files only.
> See
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files
> changed in target cluster since last snapshot sx, and copy these from the
> source target's same snapshot sx, to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding
> snapshot
> This could be a new command line switch -rdiff in distcp.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)