[
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661976#comment-14661976
]
Yongjun Zhang commented on HDFS-8828:
-------------------------------------
Thanks [~yufeigu] for the new rev and [~jingzhao] for looking into.
I did more review of rev 3, below are my comments:
# {{static DiffInfo[] getDiffsForListBuilding(SnapshotDiffReport report)}
## Please add one line in javadoc to state that DELETE is handled in
{{DistCpSync#sync}}
## Add a comment for {{if(entry.getSourcePath().length <= 0)}}, what it means
when the source path is <= 0.
# {{isParentOf(Path parent, Path child) }}
## it need to check whether the matching point is a path delimiter; in
addition,
## {{if (childPath.equals(parentPath))}} can be optimized with checking length
instead of content, once the {{startWith}} matches.
# In {{getRenameItem}}}}:
## The comment in the body: suggest to change
// The same path string may appear in:
// 1. both renamed and modified snapshot diff entries,
// 2. both renamed and created snapshot diff entries
// Case 1 is the about same file/directory, whereas case 2 is about two
different files/directories.
// We are finding case 1 here, thus we check against DiffType.MODIFY.
## Add a space in {{}else if}}
## Add a comment in the {{else if}} branch saying that this item can be either
modified or created.
# In {{static Path getTargetPath(Path sourcePath, DiffInfo renameItem)}}
## change "by the rename item" to "based on the rename item"
## {{StringBuffer sb = new StringBuffer(sourcePath.toString());}} to after the
first return statement.
# In {{getDiffList(DistCpOptions options)}},
## The two calls to {{finalListWithTarget.add(diff);}} in
{{getDiffList(DistCpOptions options)}} can be consolidated.
## The check of {{if (diff.target != null)}} meant to check whether the entry
is not rename, can we change it to check the type instead of target to be more
clear?
# Need some improvement in javadoc for {{public void
doBuildListingWithSnapshotDiff}}. Suggest to replace "We need to handle rename
item here since some created/modified items could be renamed as well." with "An
item can be created/modified and renamed, in which case, the target path is put
into the list".
I did not review the test code yet, hope you can revisit and can catch
something yourself. I will review after this round.
Thanks.
> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
> Key: HDFS-8828
> URL: https://issues.apache.org/jira/browse/HDFS-8828
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp, snapshots
> Reporter: Yufei Gu
> Assignee: Yufei Gu
> Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch,
> HDFS-8828.003.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30
> hours for 1.6M files). We can leverage snapshot diff report to build file
> copy list including files/dirs which are changes only between two snapshots
> (or a snapshot and a normal dir). It speed up the process in two folds: 1.
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation,
> deletion, rename and modification between two snapshots or a snapshot and a
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to
> the default distcp. So it still relies on default distcp to building complete
> list of files under the source dir. This patch only puts creation and
> modification files into the copy list based on snapshot diff report. We can
> minimize the number of files to copy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)