[
https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512098#comment-15512098
]
Yongjun Zhang commented on HDFS-10314:
--------------------------------------
Hi [~jingzhao],
Thanks again for your earlier feedback.
Would like to share the details below about why I don't think your proposed
method is simpler. Hope it makes some sense to you, and please correct me if
I'm wrong. I hope you could elaborate here to help me understand better.
DistCp does two basic steps:
# based on the input, create the copyListing, which is a sequence file for
mapreduce, and each entry contains info to figure out one pair of <source,
target> and file attribute info
# throw the sequence file to the mapreduce job
Step 2 is relatively stable these days, mostly we are manipulating step 1 based
on the input.
"-diff s1 s2" replaced the original step 1 with a new step 1:
* 1.1 compute snapshot diff,
* 1.2 figure out the rename/delete operation's source and target, based on the
snapshot diff info
* 1.3 apply the rename/delete to the target path
* 1.4 figure out the add/modification operation's source and target, based on
the snapshot diff info
* 1.5 create copyListing based on step 1.4
*The tricky parts* are 1,2 and 1.4, and the order of applying the rename/delete
operations in step 1.3. With HDFS-7535 and HDFS-8828, a framework has been
implemented in DistCp that does the new step 1. What I did was to re-use the
framework.
Now the questions:
# With what you proposed, I don't see how the tricky parts I listed above are
simplified. And you suggested not to touch existing DistCp implementation, I
thought you meant to rewrite the code that does the tricky parts, which is not
simpler.
# Which step in your proposal will generate the copyListing? Step 3 or step 4?
** If it's in step 3, how we are going to pass the result to distcp in step 4?
** or if it's in step 4, that means we need to calculate the snapshot diff
again in step 4, and do the tricky manipulation again there. It doesn't look
simpler, and probably additional access to NN.
# Would you please share the specific problems you see with my implementation,
other than you think your proposal would be simpler? I really hope you could do
that.
Thanks much.
> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>
> Key: HDFS-10314
> URL: https://issues.apache.org/jira/browse/HDFS-10314
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: tools
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HDFS-10314.001.patch
>
>
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of
> -diff switch.
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName> <toSnapshotName> <source> <target>
> {code}
> This command ensure <fromSnapshotName> is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName> <toSnapshotName> <source> <target>
> {code}
> that ensures <fromSnapshotName> is older than <toSnapshotName>.
> Thanks [~jingzhao].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]