[ 
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661976#comment-14661976
 ] 

Yongjun Zhang commented on HDFS-8828:
-------------------------------------

Thanks [~yufeigu] for the new rev and [~jingzhao] for looking into.

I did more review of rev 3, below are my comments:

# {{static DiffInfo[] getDiffsForListBuilding(SnapshotDiffReport report)}
## Please add one line in javadoc to state that DELETE is handled in 
{{DistCpSync#sync}}
## Add a comment for {{if(entry.getSourcePath().length <= 0)}}, what it means 
when the source path is <= 0.
# {{isParentOf(Path parent, Path child) }} 
## it need to check whether the matching point is a path delimiter; in 
addition, 
## {{if (childPath.equals(parentPath))}} can be optimized with checking length 
instead of content, once the {{startWith}} matches.
# In {{getRenameItem}}}}:
## The comment in the body: suggest to change
      // The same path string may appear in:
      //   1. both renamed and modified snapshot diff entries,
      //   2. both renamed and created snapshot diff entries
      // Case 1 is the about same file/directory, whereas case 2 is about two 
different files/directories. 
      // We are finding case 1 here, thus we check against DiffType.MODIFY.
## Add a space in {{}else if}}
## Add a comment in the {{else if}} branch saying that this item can be either 
modified or created.
# In {{static Path getTargetPath(Path sourcePath, DiffInfo renameItem)}}
## change "by the rename item" to "based on the rename item" 
## {{StringBuffer sb = new StringBuffer(sourcePath.toString());}} to after the 
first return statement.
# In {{getDiffList(DistCpOptions options)}},
## The two calls to {{finalListWithTarget.add(diff);}} in 
{{getDiffList(DistCpOptions options)}} can be consolidated.
## The check of {{if (diff.target != null)}} meant to check whether the entry 
is not rename, can we change it to check the type instead of target to be more 
clear?
# Need some improvement in javadoc for {{public void 
doBuildListingWithSnapshotDiff}}. Suggest to replace "We need to handle rename 
item here since some created/modified items could be renamed as well." with "An 
item can be created/modified and renamed, in which case, the target path is put 
into the list".

I did not review the test code yet, hope you can revisit and can catch 
something yourself. I will review after this round.

Thanks.


> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
>                 Key: HDFS-8828
>                 URL: https://issues.apache.org/jira/browse/HDFS-8828
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>         Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, 
> HDFS-8828.003.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30 
> hours for 1.6M files). We can leverage snapshot diff report to build file 
> copy list including files/dirs which are changes only between two snapshots 
> (or a snapshot and a normal dir). It speed up the process in two folds: 1. 
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation, 
> deletion, rename and modification between two snapshots or a snapshot and a 
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to 
> the default distcp. So it still relies on default distcp to building complete 
> list of files under the source dir. This patch only puts creation and 
> modification files into the copy list based on snapshot diff report. We can 
> minimize the number of files to copy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to