Yongjun Zhang created HDFS-9820:
-----------------------------------
Summary: Improve distcp to support efficient restore
Key: HDFS-9820
URL: https://issues.apache.org/jira/browse/HDFS-9820
Project: Hadoop HDFS
Issue Type: New Feature
Components: distcp
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are
some complexity and challenges.
HDFS-7535 improved distcp performance by avoiding copying files that changed
name since last backup.
On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data
from source to target cluster, by only copying changed files since last backup.
The way it works is use snapshot diff to find out all files changed, and copy
the changed files only.
See
https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
This jira is to propose a variation of HDFS-8828, to find out the files changed
in target cluster since last snapshot sx, and copy these from the source
target's same snapshot sx, to restore target cluster to sx.
If a file/dir is
- renamed, rename it back
- created in target cluster, delete it
- modified, put it to the copy list
- run distcp with the copy list, copy from the source cluster's corresponding
snapshot
This could be a new command line switch -rdiff in distcp.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)