[jira] [Updated] (MAPREDUCE-2765) DistCp Rewrite

Mithun Radhakrishnan (Updated) (JIRA) Wed, 18 Jan 2012 18:57:11 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mithun Radhakrishnan updated MAPREDUCE-2765:
--------------------------------------------

    Release Note: DistCpV2 added to hadoop-tools.
          Status: Patch Available  (was: Open)
    
> DistCp Rewrite
> --------------
>
>                 Key: MAPREDUCE-2765
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: distcp, mrv2
>    Affects Versions: 0.23.0, 0.20.203.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>             Fix For: 0.23.1
>
>         Attachments: 2765_hadoop-branch-0.23.patch, 2765_trunk.patch, 
> distcpv2.20.203.patch, distcpv2_hadoop-0.23.1.patch, 
> distcpv2_hadoop-0.23.1.patch, distcpv2_hadoop-trunk.patch, 
> distcpv2_patch_0.23.1-SNAPSHOT_tucu_reviewed.patch, 
> distcpv2_patch_hadoop-trunk_tucu_reviewed.patch, distcpv2_trunk.patch, 
> distcpv2_trunk_post_review_1.patch
>
>
> This is a slightly modified version of the DistCp rewrite that Yahoo uses in 
> production today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR job)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a single 
> large (1.7 KLOC) source file, into a design that (hopefully) reads better too.
> The proposed DistCpV2 preserves command-line-compatibility with the old 
> version, and should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
>       A copy-strategy determines the policy by which source-file-paths are 
> distributed between map-tasks. (These boil down to the choice of the 
> input-format.) 
>       If no strategy is explicitly specified on the command-line, the policy 
> chosen is "uniform size", where v2 behaves identically to old-DistCp. (The 
> number of bytes transferred by each map-task is roughly equal, at a per-file 
> granularity.) 
>       Alternatively, v2 ships with a "dynamic" copy-strategy (in the 
> DynamicInputFormat). This policy acknowledges that 
>               (a)  dividing files based only on file-size might not be an 
> even distribution (E.g. if some datanodes are slower than others, or if some 
> files are skipped.)
>               (b) a "static" association of a source-path to a map increases 
> the likelihood of long-tails during copy.
>       The "dynamic" strategy divides the list-of-source-paths into a number 
> (> nMaps) of smaller parts. When each map completes its current list of 
> paths, it picks up a new list to process, if available. So if a map-task is 
> stuck on a slow (and not necessarily large) file, other maps can pick up the 
> slack. The thinner the file-list is sliced, the greater the parallelism (and 
> the lower the chances of long-tails). Within reason, of course: the number of 
> these short-lived list-files is capped at an overridable maximum.
>       Internal benchmarks against source/target clusters with some slow(ish) 
> datanodes have indicated significant performance gains when using the 
> dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
>       Please note that the DynamicInputFormat might prove useful outside of 
> DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also 
> note that the copy-strategies have no bearing on the CopyMapper.map() 
> implementation.
>       
> 2. Improved startup-time and programmatic use:
>       When the old-DistCp runs with -update, and creates the 
> list-of-source-paths, it attempts to filter out files that might be skipped 
> (by comparing file-sizes, checksums, etc.) This significantly increases the 
> startup time (or the time spent in serial processing till the MR job is 
> launched), blocking the calling-thread. This becomes pronounced as nFiles 
> increases. (Internal benchmarks have seen situations where more time is spent 
> setting up the job than on the actual transfer.)
>       DistCpV2 postpones as much work as possible to the MR job. The 
> file-listing isn't filtered until the map-task runs (at which time, identical 
> files are skipped). DistCpV2 can now be run "asynchronously". The program 
> quits at job-launch, logging the job-id for tracking. Programmatically, the 
> DistCp.execute() returns a Job instance for progress-tracking.
>       
> 3. New features:
>       (a)   -async: As described in #2.
>       (b)   -atomic: Data is copied to a (user-specifiable) tmp-location, and 
> then moved atomically to destination.
>       (c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
>       (d)   -strategy: As above.    
>       
> A more comprehensive description the newer features, how the dynamic-strategy 
> works, etc. is available in src/site/xdoc/, and in the pdf that's generated 
> therefrom, during the build.
> High on the list of things to do is support to parallelize copies on a 
> per-block level. (i.e. Incorporation of HDFS-222.)
> I look forward to comments, suggestions and discussion that will hopefully 
> ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 
> 0.23.0 (complete with unit-tests).
> P.S.
> A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for 
> ideas, code, reviews and guidance. Although much of the code is mine, the 
> idea to use the DFS to implement "dynamic" input-splits wasn't.
>       

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2765) DistCp Rewrite

Reply via email to