[
https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mithun Radhakrishnan updated MAPREDUCE-2765:
--------------------------------------------
Release Note: DistCpV2 added to hadoop-tools.
Status: Patch Available (was: Open)
> DistCp Rewrite
> --------------
>
> Key: MAPREDUCE-2765
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: distcp, mrv2
> Affects Versions: 0.23.0, 0.20.203.0
> Reporter: Mithun Radhakrishnan
> Assignee: Mithun Radhakrishnan
> Fix For: 0.23.1
>
> Attachments: 2765_hadoop-branch-0.23.patch, 2765_trunk.patch,
> distcpv2.20.203.patch, distcpv2_hadoop-0.23.1.patch,
> distcpv2_hadoop-0.23.1.patch, distcpv2_hadoop-trunk.patch,
> distcpv2_patch_0.23.1-SNAPSHOT_tucu_reviewed.patch,
> distcpv2_patch_hadoop-trunk_tucu_reviewed.patch, distcpv2_trunk.patch,
> distcpv2_trunk_post_review_1.patch
>
>
> This is a slightly modified version of the DistCp rewrite that Yahoo uses in
> production today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR job)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a single
> large (1.7 KLOC) source file, into a design that (hopefully) reads better too.
> The proposed DistCpV2 preserves command-line-compatibility with the old
> version, and should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
> A copy-strategy determines the policy by which source-file-paths are
> distributed between map-tasks. (These boil down to the choice of the
> input-format.)
> If no strategy is explicitly specified on the command-line, the policy
> chosen is "uniform size", where v2 behaves identically to old-DistCp. (The
> number of bytes transferred by each map-task is roughly equal, at a per-file
> granularity.)
> Alternatively, v2 ships with a "dynamic" copy-strategy (in the
> DynamicInputFormat). This policy acknowledges that
> (a) dividing files based only on file-size might not be an
> even distribution (E.g. if some datanodes are slower than others, or if some
> files are skipped.)
> (b) a "static" association of a source-path to a map increases
> the likelihood of long-tails during copy.
> The "dynamic" strategy divides the list-of-source-paths into a number
> (> nMaps) of smaller parts. When each map completes its current list of
> paths, it picks up a new list to process, if available. So if a map-task is
> stuck on a slow (and not necessarily large) file, other maps can pick up the
> slack. The thinner the file-list is sliced, the greater the parallelism (and
> the lower the chances of long-tails). Within reason, of course: the number of
> these short-lived list-files is capped at an overridable maximum.
> Internal benchmarks against source/target clusters with some slow(ish)
> datanodes have indicated significant performance gains when using the
> dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
> Please note that the DynamicInputFormat might prove useful outside of
> DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also
> note that the copy-strategies have no bearing on the CopyMapper.map()
> implementation.
>
> 2. Improved startup-time and programmatic use:
> When the old-DistCp runs with -update, and creates the
> list-of-source-paths, it attempts to filter out files that might be skipped
> (by comparing file-sizes, checksums, etc.) This significantly increases the
> startup time (or the time spent in serial processing till the MR job is
> launched), blocking the calling-thread. This becomes pronounced as nFiles
> increases. (Internal benchmarks have seen situations where more time is spent
> setting up the job than on the actual transfer.)
> DistCpV2 postpones as much work as possible to the MR job. The
> file-listing isn't filtered until the map-task runs (at which time, identical
> files are skipped). DistCpV2 can now be run "asynchronously". The program
> quits at job-launch, logging the job-id for tracking. Programmatically, the
> DistCp.execute() returns a Job instance for progress-tracking.
>
> 3. New features:
> (a) -async: As described in #2.
> (b) -atomic: Data is copied to a (user-specifiable) tmp-location, and
> then moved atomically to destination.
> (c) -bandwidth: Enforces a limit on the bandwidth consumed per map.
> (d) -strategy: As above.
>
> A more comprehensive description the newer features, how the dynamic-strategy
> works, etc. is available in src/site/xdoc/, and in the pdf that's generated
> therefrom, during the build.
> High on the list of things to do is support to parallelize copies on a
> per-block level. (i.e. Incorporation of HDFS-222.)
> I look forward to comments, suggestions and discussion that will hopefully
> ensue. I have this running against Hadoop 0.20.203.0. I also have a port to
> 0.23.0 (complete with unit-tests).
> P.S.
> A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for
> ideas, code, reviews and guidance. Although much of the code is mine, the
> idea to use the DFS to implement "dynamic" input-splits wasn't.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira