[ 
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662875#comment-14662875
 ] 

Yongjun Zhang commented on HDFS-8828:
-------------------------------------

HI [~yufeigu],

Thanks for addressing my above comments and verbal one about adding assertion 
in the testing code in the new rev. I did one more round of review, and have 
couple of nits and an improvement comments:

# move the following two statements to next to each other.
{code}
    path = getPathWithSchemeAndAuthority(path);
    path = makeQualified(path);
{code}
# change
{code}
   for (Map.Entry<Text, CopyListingFileStatus> entry : copyListing.entrySet())
   {
{code}
back to 
{code}
    for (Map.Entry<Text, CopyListingFileStatus> entry : copyListing.entrySet()) 
{
{code}
I know that this means 81 chars one line, I guess we favor the coding guideline 
of not to have "{" at a new line. This will possibly solve the checkstyle warn 
reported by jenkins (I can't see it now because the link is not accessible 
right now, please look into anyways).
# In TestDistCpSync.java, suggest to modify all {{changeData<XYZ>}} method to 
return an integer of number of created/modified items,
as below:
{code}
int changeData<xyz>() {
  int numCreatedModified = 0;
  ...
  // after creating/modifying a file
  numCreatedModified += 1;
  ...
  return numCreatedModified;
}
{code}
And replace the hardcoded number in the added assertion in each test with the 
returned value.
This way, it's easier to maintain the test code.
 
Hi [~jingzhao], it largely looks good to me after addressing the above 
comments. Would you please help taking a look from your perspective?

Many thanks.




> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
>                 Key: HDFS-8828
>                 URL: https://issues.apache.org/jira/browse/HDFS-8828
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>         Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, 
> HDFS-8828.003.patch, HDFS-8828.004.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30 
> hours for 1.6M files). We can leverage snapshot diff report to build file 
> copy list including files/dirs which are changes only between two snapshots 
> (or a snapshot and a normal dir). It speed up the process in two folds: 1. 
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation, 
> deletion, rename and modification between two snapshots or a snapshot and a 
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to 
> the default distcp. So it still relies on default distcp to building complete 
> list of files under the source dir. This patch only puts creation and 
> modification files into the copy list based on snapshot diff report. We can 
> minimize the number of files to copy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to