[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated MAPREDUCE-2765:
--------------------------------------------

    Description: 
This is a slightly modified version of the DistCp rewrite that Yahoo uses in 
production today. The rewrite was ground-up, with specific focus on:
1. improved startup time (postponing as much work as possible to the MR job)
2. support for multiple copy-strategies
3. new features (e.g. -atomic, -async, -bandwidth.)
4. improved programmatic use
Some effort has gone into refactoring what used to be achieved by a single 
large (1.7 KLOC) source file, into a design that (hopefully) reads better too.

The proposed DistCpV2 preserves command-line-compatibility with the old 
version, and should be a drop-in replacement.

New to v2:

1. Copy-strategies and the DynamicInputFormat:
        A copy-strategy determines the policy by which source-file-paths are 
distributed between map-tasks. (These boil down to the choice of the 
input-format.) 
        If no strategy is explicitly specified on the command-line, the policy 
chosen is "uniform size", where v2 behaves identically to old-DistCp. (The 
number of bytes transferred by each map-task is roughly equal, at a per-file 
granularity.) 
        Alternatively, v2 ships with a "dynamic" copy-strategy (in the 
DynamicInputFormat). This policy acknowledges that 
                (a)  dividing files based only on file-size might not be an 
even distribution (E.g. if some datanodes are slower than others, or if some 
files are skipped.)
                (b) a "static" association of a source-path to a map increases 
the likelihood of long-tails during copy.
        The "dynamic" strategy divides the list-of-source-paths into a number 
(> nMaps) of smaller parts. When each map completes its current list of paths, 
it picks up a new list to process, if available. So if a map-task is stuck on a 
slow (and not necessarily large) file, other maps can pick up the slack. The 
thinner the file-list is sliced, the greater the parallelism (and the lower the 
chances of long-tails). Within reason, of course: the number of these 
short-lived list-files is capped at an overridable maximum.
        Internal benchmarks against source/target clusters with some slow(ish) 
datanodes have indicated significant performance gains when using the 
dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
        Please note that the DynamicInputFormat might prove useful outside of 
DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also 
note that the copy-strategies have no bearing on the CopyMapper.map() 
implementation.
        
2. Improved startup-time and programmatic use:
        When the old-DistCp runs with -update, and creates the 
list-of-source-paths, it attempts to filter out files that might be skipped (by 
comparing file-sizes, checksums, etc.) This significantly increases the startup 
time (or the time spent in serial processing till the MR job is launched), 
blocking the calling-thread. This becomes pronounced as nFiles increases. 
(Internal benchmarks have seen situations where more time is spent setting up 
the job than on the actual transfer.)
        DistCpV2 postpones as much work as possible to the MR job. The 
file-listing isn't filtered until the map-task runs (at which time, identical 
files are skipped). DistCpV2 can now be run "asynchronously". The program quits 
at job-launch, logging the job-id for tracking. Programmatically, the 
DistCp.execute() returns a Job instance for progress-tracking.
        
3. New features:
        (a)   -async: As described in #2.
        (b)   -atomic: Data is copied to a (user-specifiable) tmp-location, and 
then moved atomically to destination.
        (c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
        (d)   -strategy: As above.    
        
A more comprehensive description the newer features, how the dynamic-strategy 
works, etc. is available in src/site/xdoc/, and in the pdf that's generated 
therefrom, during the build.

High on the list of things to do is support to parallelize copies on a 
per-block level. (i.e. Incorporation of HDFS-222.)

I look forward to comments, suggestions and discussion that will hopefully 
ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 
0.23.0 (complete with unit-tests).

P.S.
A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for 
ideas, code, reviews and guidance. Although much of the code is mine, the idea 
to use the DFS to implement "dynamic" input-splits wasn't.
        


  was:
This is a slightly modified version of the DistCp rewrite that Yahoo(!) uses in 
production today. The rewrite was ground-up, with specific focus on:
1. improved startup time (postponing as much work as possible to the MR job)
2. support for multiple copy-strategies
3. new features (e.g. -atomic, -async, -bandwidth.)
4. improved programmatic use
Some effort has gone into refactoring what used to be achieved by a single 
large (1.7 KLOC) source file, into a design that (hopefully) reads better too.

The proposed DistCpV2 preserves command-line-compatibility with the old 
version, and should be a drop-in replacement.

New to v2:

1. Copy-strategies and the DynamicInputFormat:
        A copy-strategy determines the policy by which source-file-paths are 
distributed between map-tasks. (These boil down to the choice of the 
input-format.) 
        If no strategy is explicitly specified on the command-line, the policy 
chosen is "uniform size", where v2 behaves identically to old-DistCp. (The 
number of bytes transferred by each map-task is roughly equal, at a per-file 
granularity.) 
        Alternatively, v2 ships with a "dynamic" copy-strategy (in the 
DynamicInputFormat). This policy acknowledges that 
                (i)  dividing files based only on file-size might not be an 
even distribution (E.g. if some datanodes are slower than others, or if some 
files are skipped.)
                (ii) a "static" association of a source-path to a map increases 
the likelihood of long-tails during copy.
        The "dynamic" strategy divides the list-of-source-paths into a number 
(> nMaps) of smaller parts. When each map completes its current list of paths, 
it picks up a new list to process, if available. So if a map-task is stuck on a 
slow (and not necessarily large) file, other maps can pick up the slack. The 
thinner the file-list is sliced, the greater the parallelism (and the lower the 
chances of long-tails). Within reason, of course: the number of these 
short-lived list-files is capped at an overridable maximum.
        Internal benchmarks against source/target clusters with some slow(ish) 
datanodes have indicated significant performance gains when using the 
dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
        Please note that the DynamicInputFormat might prove useful outside of 
DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also 
note that the copy-strategies have no bearing on the CopyMapper.map() 
implementation.
        
2. Improved startup-time and programmatic use:
        When the old-DistCp runs with -update, and creates the 
list-of-source-paths, it attempts to filter out files that might be skipped (by 
comparing file-sizes, checksums, etc.) This significantly increases the startup 
time (or the time spent in serial processing till the MR job is launched), 
blocking the calling-thread. This becomes pronounced as nFiles increases. 
(Internal benchmarks have seen situations where more time is spent setting up 
the job than on the actual transfer.)
        DistCpV2 postpones as much work as possible to the MR job. The 
file-listing isn't filtered until the map-task runs (at which time, identical 
files are skipped). DistCpV2 can now be run "asynchronously". The program quits 
at job-launch, logging the job-id for tracking. Programmatically, the 
DistCp.execute() returns a Job instance for progress-tracking.
        
3. New features:
        (i)   -async: As described in #2.
        (ii)  -atomic: Data is copied to a (user-specifiable) tmp-location, and 
then moved atomically to destination.
        (iii) -bandwidth: Enforces a limit on the bandwidth consumed per map.
        (iv)  -strategy: As above.    
        
A more comprehensive description the newer features, how the dynamic-strategy 
works, etc. is available in src/site/xdoc/, and in the pdf that's generated 
therefrom, during the build.

I look forward to comments, suggestions and discussion that will hopefully 
ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 
0.23.0 (complete with unit-tests).

P.S.
A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for 
ideas, code, reviews and guidance. Although much of the code is mine, the idea 
to use the DFS to implement "dynamic" input-splits wasn't.
        



> DistCp Rewrite
> --------------
>
>                 Key: MAPREDUCE-2765
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: distcp
>    Affects Versions: 0.20.203.0
>            Reporter: Mithun Radhakrishnan
>         Attachments: distcpv2.20.203.patch
>
>
> This is a slightly modified version of the DistCp rewrite that Yahoo uses in 
> production today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR job)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a single 
> large (1.7 KLOC) source file, into a design that (hopefully) reads better too.
> The proposed DistCpV2 preserves command-line-compatibility with the old 
> version, and should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
>       A copy-strategy determines the policy by which source-file-paths are 
> distributed between map-tasks. (These boil down to the choice of the 
> input-format.) 
>       If no strategy is explicitly specified on the command-line, the policy 
> chosen is "uniform size", where v2 behaves identically to old-DistCp. (The 
> number of bytes transferred by each map-task is roughly equal, at a per-file 
> granularity.) 
>       Alternatively, v2 ships with a "dynamic" copy-strategy (in the 
> DynamicInputFormat). This policy acknowledges that 
>               (a)  dividing files based only on file-size might not be an 
> even distribution (E.g. if some datanodes are slower than others, or if some 
> files are skipped.)
>               (b) a "static" association of a source-path to a map increases 
> the likelihood of long-tails during copy.
>       The "dynamic" strategy divides the list-of-source-paths into a number 
> (> nMaps) of smaller parts. When each map completes its current list of 
> paths, it picks up a new list to process, if available. So if a map-task is 
> stuck on a slow (and not necessarily large) file, other maps can pick up the 
> slack. The thinner the file-list is sliced, the greater the parallelism (and 
> the lower the chances of long-tails). Within reason, of course: the number of 
> these short-lived list-files is capped at an overridable maximum.
>       Internal benchmarks against source/target clusters with some slow(ish) 
> datanodes have indicated significant performance gains when using the 
> dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
>       Please note that the DynamicInputFormat might prove useful outside of 
> DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also 
> note that the copy-strategies have no bearing on the CopyMapper.map() 
> implementation.
>       
> 2. Improved startup-time and programmatic use:
>       When the old-DistCp runs with -update, and creates the 
> list-of-source-paths, it attempts to filter out files that might be skipped 
> (by comparing file-sizes, checksums, etc.) This significantly increases the 
> startup time (or the time spent in serial processing till the MR job is 
> launched), blocking the calling-thread. This becomes pronounced as nFiles 
> increases. (Internal benchmarks have seen situations where more time is spent 
> setting up the job than on the actual transfer.)
>       DistCpV2 postpones as much work as possible to the MR job. The 
> file-listing isn't filtered until the map-task runs (at which time, identical 
> files are skipped). DistCpV2 can now be run "asynchronously". The program 
> quits at job-launch, logging the job-id for tracking. Programmatically, the 
> DistCp.execute() returns a Job instance for progress-tracking.
>       
> 3. New features:
>       (a)   -async: As described in #2.
>       (b)   -atomic: Data is copied to a (user-specifiable) tmp-location, and 
> then moved atomically to destination.
>       (c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
>       (d)   -strategy: As above.    
>       
> A more comprehensive description the newer features, how the dynamic-strategy 
> works, etc. is available in src/site/xdoc/, and in the pdf that's generated 
> therefrom, during the build.
> High on the list of things to do is support to parallelize copies on a 
> per-block level. (i.e. Incorporation of HDFS-222.)
> I look forward to comments, suggestions and discussion that will hopefully 
> ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 
> 0.23.0 (complete with unit-tests).
> P.S.
> A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for 
> ideas, code, reviews and guidance. Although much of the code is mine, the 
> idea to use the DFS to implement "dynamic" input-splits wasn't.
>       

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to