[ 
https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Whitacre resolved CRUNCH-679.
-----------------------------------
    Resolution: Fixed

> Improvements for usage of DistCp
> --------------------------------
>
>                 Key: CRUNCH-679
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-679
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Andrew Olson
>            Assignee: Josh Wills
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-00000. Currently the 
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
> method, and would therefore create destination file names like out0-m-00000, 
> which are problematic when there are multiple map-only jobs writing to the 
> same target path. This can be achieved by providing a custom CopyListing 
> implementation that is capable of dynamically renaming target paths based on 
> a given mapping. Unfortunately a substantial amount of code duplication from 
> the original SimpleCopyListing class is currently required in order to inject 
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147 
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to 
> the one in FileTargetImpl that it overrides. We can remove it and just share 
> the same code.
> * It could be useful to add a property for configuring the max DistCp task 
> bandwidth, as the default (100 MB/s per task) may be too high for certain 
> environments.
> * The default of 1000 for max DistCp map tasks may be too high in some 
> situations resulting in 503 Slow Down errors from S3 especially if there are 
> multiple jobs writing into the same bucket. Reducing to 100 should help 
> prevent issues along those lines while still providing adequate parallel 
> throughput.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to