Yongjun Zhang commented on HDFS-10314:

Many thanks [~jingzhao] for the review and feedback. Please see my answers 

So the current patch actually adds a new distsync extension and implements the 
"calculating diff on target cluster" approach? 
Yes. This is the result of our discussion in HDFS-9820. 

Though I preferred adding -rdiff as a symmetric behavior as how -diff works in 
distcp, as reported in HDFS-9820, I think your suggestion of creating a new 
tool is fine, as long as we leverage the code that does -diff in distcp, and 
minimize code duplication.

I think to have the diff calculated on target is fine, 
Yes. Since the goal is to make the target's state go to a specified snapshot,  
we'd better calculate snapshot diff at the target.

but I'm not sure to directly extend the current distcp is a good idea.
There are couple of reasons when I came up with the idea of extending distcp:
* distsync is a customized distcp,  it extends distcp's -diff behavior to 
support -rdiff.
* it's better to re-use the code that implements -diff, extending allows 
re-using the existing implementation of "-diff". You can see it's only 124 
lines of code (including the header and imports) in DistSync.java in my patch 
Correct me if I'm wrong. Here's my current understanding of the patch:
1. our main motivation is still to utilize distcp to restore a snapshot
2. the idea is to compute the delta on the target cluster, and for modified 
files we get their original state from the source.
Yes. However, for modified files, I intended to make it flexible to copy from 
the specified snapshot of either the source or the target.

In that sense, I think a simpler way is to wrap (but not extend) the current 
distcp in the snapshot-restore tool:
1. The tool takes a single cluster and a target snapshot as arguments
2. The tool computes the delta for restoring using snapshot diff report
3. The tool does rename/delete etc. metadata ops to revert part of the diff
4. The tool uses the distcp (by invokes distcp as a library) to copy the 
original states of modified files
In this way we can minimize the change (no need to touch the current distcp 
implementation/arguments), and provides a new tool with simple but clear 
semantic. We may lose some flexibility (only handling one cluster) but the tool 
itself will be easy to use and will not cause any confusion to the end users.
What do you think? Please let me know if I miss anything.

We discussed two overall solutions earlier.

* Solution A. What proposed in HDFS-9820: adding "-rdiff s2 s1" to distcp, to 
achieve the symmetric behavior as "-diff s1 s2" of distcp.
* Solution B. What proposed in HDFS-10314: introducing a new tool, that allows 
to sync a target cluster to a specified snapshot.

For Solution B,  there are two approaches, one (B.1) is my patch rev001 here, 
the other (B.2) is what you proposed above. 

Some thoughts:

# Creating a new tool itself is going to mean extra support, that's why I 
preferred solution #A, which is the simplest.
# Given that we want to create a new tool, we'd better maximize code sharing, 
otherwise, it's going to be both more development effort and extra support 
# To me, the way suggested by solution #B.2 disallows sharing the existing 
implementation of -diff in distcp. Thus I think it's actually not simpler, and 
would incur support burden for future because of the duplicated code.
# I think we agreed per our discussion that if we create a new tool, then you 
don't have strong opinion whether we copy from a different cluster or from the 
same target cluster. As I shared earlier, I can tell from the user's case, that 
copying from a different mirror cluster can be much faster sometimes. So I kept 
suggesting that it would be better to support the flexibility, to copy from 
either the source or the target.

Would you please kindly share the specific problems you see with solution #B.1? 

Honestly speaking, I still prefer solution #A. But I'm ok with solution B, 
except I really hope to share the code of -diff implemented in distcp already.

Thanks a lot.

> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10314.001.patch
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of 
> -diff switch. 
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps 
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux 
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality 
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to