[
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703591#comment-14703591
]
Yongjun Zhang commented on HDFS-8828:
-------------------------------------
Hi [~jingzhao],
Thanks a lot for your review and comments, I discussed with [~yufeigu] and he
worked out the new revs to address your comments.
HI [~yufeigu], thanks for the new rev, some nits:
* put the following code into its own method, like
createInputFileListingWithDiff
{code}
180 Path fileListingPath = getFileListingPath();
181 CopyListing copyListing =
182 new SimpleCopyListing(job.getConfiguration(),
183 job.getCredentials(), distCpSync);
184 copyListing.buildListing(fileListingPath, inputOptions);
{code}
so this can be in parallel with the existing method
{{createInputFileListing(Job job)}}
* you accidentally changed {{*
http://www.apache.org/licenses/LICENSE-2.0}}, please revert this change
* In comments, "//xyz" should be "// xyz", notice the space between "//" and
the text
Please consider addressing them together with what Jing might have,
Hi [~jingzhao], it looks good to me after the above nits addressed. Would you
mind take another look so Yufei can address altogether if you have any more
comments?
Thanks,
> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
> Key: HDFS-8828
> URL: https://issues.apache.org/jira/browse/HDFS-8828
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: distcp, snapshots
> Reporter: Yufei Gu
> Assignee: Yufei Gu
> Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch,
> HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch,
> HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch,
> HDFS-8828.009.patch, HDFS-8828.010.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30
> hours for 1.6M files). We can leverage snapshot diff report to build file
> copy list including files/dirs which are changes only between two snapshots
> (or a snapshot and a normal dir). It speed up the process in two folds: 1.
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation,
> deletion, rename and modification between two snapshots or a snapshot and a
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to
> the default distcp. So it still relies on default distcp to building complete
> list of files under the source dir. This patch only puts creation and
> modification files into the copy list based on snapshot diff report. We can
> minimize the number of files to copy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)