[ 
https://issues.apache.org/jira/browse/HDFS-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15512098#comment-15512098
 ] 

Yongjun Zhang commented on HDFS-10314:
--------------------------------------

Hi [~jingzhao],

Thanks again for your earlier feedback. 

Would like to share the details below about why I don't think your proposed 
method is simpler. Hope it makes some sense to you, and please correct me if 
I'm wrong. I hope you could elaborate here to help me understand better.

DistCp does two basic steps:
# based on the input, create the copyListing, which is a sequence file for 
mapreduce, and each entry contains info to  figure out one pair of <source, 
target> and file attribute info
# throw the sequence file to the mapreduce job

Step 2 is relatively stable these days, mostly we are manipulating step 1 based 
on the input. 

"-diff s1 s2" replaced the original step 1 with a new step 1:
* 1.1 compute snapshot diff, 
* 1.2 figure out the rename/delete operation's source and target, based on the 
snapshot diff info
* 1.3 apply the rename/delete to the target path
* 1.4 figure out the add/modification operation's source and target, based on 
the snapshot diff info
* 1.5 create copyListing based on step 1.4

*The tricky parts* are 1,2 and 1.4, and the order of applying the rename/delete 
operations in step 1.3.  With HDFS-7535 and HDFS-8828, a framework has been 
implemented in DistCp that does the new step 1. What I did was to re-use the 
framework.

Now the questions:

# With what you proposed, I don't see how the tricky parts I listed above are 
simplified. And you suggested not to touch existing DistCp implementation, I 
thought you meant to rewrite the code that does the tricky parts, which is not 
simpler.  
# Which step in your proposal will generate the copyListing? Step 3 or step 4? 
** If it's in step 3, how we are going to pass the result to distcp in step 4?
** or if it's in step 4, that means we need to calculate the snapshot diff 
again in step 4, and do the tricky manipulation again there. It doesn't look 
simpler, and probably additional access to NN.
# Would you please share the specific problems you see with my implementation, 
other than you think your proposal would be simpler? I really hope you could do 
that.

Thanks much.
















> A new tool to sync current HDFS view to specified snapshot
> ----------------------------------------------------------
>
>                 Key: HDFS-10314
>                 URL: https://issues.apache.org/jira/browse/HDFS-10314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10314.001.patch
>
>
> HDFS-9820 proposed adding -rdiff switch to distcp, as a reversed operation of 
> -diff switch. 
> Upon discussion with [~jingzhao], we will introduce a new tool that wraps 
> around distcp to achieve the same purpose.
> I'm thinking about calling the new tool "rsync", similar to unix/linux 
> command "rsync". The "r" here means remote.
> The syntax that simulate -rdiff behavior proposed in HDFS-9820 is
> {code}
> rsync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> This command ensure <fromSnapshotName>  is newer than <toSnapshotName>.
> I think, In the future, we can add another command to have the functionality 
> of -diff switch of distcp.
> {code}
> sync <fromSnapshotName>  <toSnapshotName>  <source> <target>
> {code}
> that ensures <fromSnapshotName>  is older than <toSnapshotName>.
> Thanks [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to