Jing Zhao commented on HDFS-10314:

So the current patch actually adds a new distsync extension and implements the 
"calculating diff on target cluster" approach? I think to have the diff 
calculated on target is fine, but I'm not sure to directly extend the current 
distcp is a good idea.

Correct me if I'm wrong. Here's my current understanding of the patch:
1. our main motivation is still to utilize distcp to restore a snapshot
2. the idea is to compute the delta on the target cluster, and for modified 
files we get their original state from the source. 

In that sense, I think a simpler way is to wrap (but not extend) the current 
distcp in the snapshot-restore tool:
1. The tool takes a single cluster and a target snapshot as arguments
2. The tool computes the delta for restoring using snapshot diff report
3. The tool does rename/delete etc. metadata ops to revert part of the diff
4. The tool uses the distcp (by invokes distcp as a library) to copy the 
original states of modified files

In this way we can minimize the change (no need to touch the current distcp 
implementation/arguments), and provides a new tool with simple but clear 
semantic. We may lose some flexibility (only handling one cluster) but the tool 
itself will be easy to use and will not cause any confusion to the end users.

What do you think? Please let me know if I miss anything.

> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10314.001.patch
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of 
> -diff switch. 
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps 
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux 
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality 
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to