[
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698985#comment-14698985
]
Jing Zhao commented on HDFS-8828:
---------------------------------
Thanks again for working on this, Yufei. The patch looks good to me overall.
Some minor comments:
# DistCpOptions may not be a good place to put SnapshotDiffReport. The diff
report is an intermediate result of the distcp, while DistCpOptions is only
used for holding all the options. Let's find another way to pass the diff
report to the list building function.
# It's better to combine {{getDiffs}} and {{getDiffsForListBuilding}} so that
we do not need to scan the diff report twice. Maybe we can let {{getDiffs}}
return an EnumMap<DiffType, List<DiffInfo>>?
# We can remove empty @param description from javadoc
# DiffInfo#type can be declared as final.
> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
> Key: HDFS-8828
> URL: https://issues.apache.org/jira/browse/HDFS-8828
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp, snapshots
> Reporter: Yufei Gu
> Assignee: Yufei Gu
> Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch,
> HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch,
> HDFS-8828.006.patch, HDFS-8828.007.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30
> hours for 1.6M files). We can leverage snapshot diff report to build file
> copy list including files/dirs which are changes only between two snapshots
> (or a snapshot and a normal dir). It speed up the process in two folds: 1.
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation,
> deletion, rename and modification between two snapshots or a snapshot and a
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to
> the default distcp. So it still relies on default distcp to building complete
> list of files under the source dir. This patch only puts creation and
> modification files into the copy list based on snapshot diff report. We can
> minimize the number of files to copy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)