Yufei Gu created HDFS-8828:
------------------------------
Summary: Utilize Snapshot diff report to build copy list in distcp
Key: HDFS-8828
URL: https://issues.apache.org/jira/browse/HDFS-8828
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Yufei Gu
Assignee: Yufei Gu
Some users reported huge time cost to build file copy list in distcp. (30 hours
with 1.6M files). We can leverage snapshot diff report to build file copy list
including files/dirs which are changes only between two snapshots (or a
snapshot and a normal dir). It speed up the process in two folds: 1. less copy
list building time. 2. less file copy MR jobs.
HDFS snapshot diff report provide information about file/directory creation,
deletion, rename and modification between two snapshots or a snapshot and a
normal directory. HDFS-7535 synchronize deletion and rename, the fallback to
the default distcp. So it still relies on default distcp to building copy list
which will traverse all files under the source dir. This patch will build the
copy list based on snapshot diff report.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)