Yongjun Zhang commented on HDFS-10314:
Many thanks [~jingzhao] for the review and feedback. Please see my answers
So the current patch actually adds a new distsync extension and implements the
"calculating diff on target cluster" approach?
Yes. This is the result of our discussion in HDFS-9820.
Though I preferred adding -rdiff as a symmetric behavior as how -diff works in
distcp, as reported in HDFS-9820, I think your suggestion of creating a new
tool is fine, as long as we leverage the code that does -diff in distcp, and
minimize code duplication.
I think to have the diff calculated on target is fine,
Yes. Since the goal is to make the target's state go to a specified snapshot,
we'd better calculate snapshot diff at the target.
but I'm not sure to directly extend the current distcp is a good idea.
There are couple of reasons when I came up with the idea of extending distcp:
* distsync is a customized distcp, it extends distcp's -diff behavior to
* it's better to re-use the code that implements -diff, extending allows
re-using the existing implementation of "-diff". You can see it's only 124
lines of code (including the header and imports) in DistSync.java in my patch
Correct me if I'm wrong. Here's my current understanding of the patch:
1. our main motivation is still to utilize distcp to restore a snapshot
2. the idea is to compute the delta on the target cluster, and for modified
files we get their original state from the source.
Yes. However, for modified files, I intended to make it flexible to copy from
the specified snapshot of either the source or the target.
In that sense, I think a simpler way is to wrap (but not extend) the current
distcp in the snapshot-restore tool:
1. The tool takes a single cluster and a target snapshot as arguments
2. The tool computes the delta for restoring using snapshot diff report
3. The tool does rename/delete etc. metadata ops to revert part of the diff
4. The tool uses the distcp (by invokes distcp as a library) to copy the
original states of modified files
In this way we can minimize the change (no need to touch the current distcp
implementation/arguments), and provides a new tool with simple but clear
semantic. We may lose some flexibility (only handling one cluster) but the tool
itself will be easy to use and will not cause any confusion to the end users.
What do you think? Please let me know if I miss anything.
We discussed two overall solutions earlier.
* Solution A. What proposed in HDFS-9820: adding "-rdiff s2 s1" to distcp, to
achieve the symmetric behavior as "-diff s1 s2" of distcp.
* Solution B. What proposed in HDFS-10314: introducing a new tool, that allows
to sync a target cluster to a specified snapshot.
For Solution B, there are two approaches, one (B.1) is my patch rev001 here,
the other (B.2) is what you proposed above.
# Creating a new tool itself is going to mean extra support, that's why I
preferred solution #A, which is the simplest.
# Given that we want to create a new tool, we'd better maximize code sharing,
otherwise, it's going to be both more development effort and extra support
# To me, the way suggested by solution #B.2 disallows sharing the existing
implementation of -diff in distcp. Thus I think it's actually not simpler, and
would incur support burden for future because of the duplicated code.
# I think we agreed per our discussion that if we create a new tool, then you
don't have strong opinion whether we copy from a different cluster or from the
same target cluster. As I shared earlier, I can tell from the user's case, that
copying from a different mirror cluster can be much faster sometimes. So I kept
suggesting that it would be better to support the flexibility, to copy from
either the source or the target.
Would you please kindly share the specific problems you see with solution #B.1?
Honestly speaking, I still prefer solution #A. But I'm ok with solution B,
except I really hope to share the code of -diff implemented in distcp already.
Thanks a lot.
> A new tool to sync current HDFS view to specified snapshot
> Key: HDFS-10314
> URL: https://issues.apache.org/jira/browse/HDFS-10314
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: tools
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HDFS-10314.001.patch
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of
> -diff switch.
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> rsync <fromSnapshotName> <toSnapshotName> <source> <target>
> This command ensure <fromSnapshotName> is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality
> of -diff switch of distcp.
> sync <fromSnapshotName> <toSnapshotName> <source> <target>
> that ensures <fromSnapshotName> is older than <toSnapshotName>.
> Thanks [~jingzhao].
This message was sent by Atlassian JIRA
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org