[ 
https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Olson updated CRUNCH-679:
--------------------------------
    Description: 
As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
improvements have been identified during testing.

* We need to preserve preferred part names, e.g. part-m-00000. Currently the 
DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
method, and would therefore create destination file names like out0-m-00000, 
which are problematic when there are multiple map-only jobs writing to the same 
target path. This can be achieved by providing a custom CopyListing 
implementation that is capable of dynamically renaming target paths based on a 
given mapping. Unfortunately a substantial amount of code duplication from the 
original SimpleCopyListing class is currently required in order to inject the 
necessary logic for modifying the sequence file entry keys. HADOOP-16147 has 
been opened to allow it to be simplified in the future.

* The handleOutputs implementation in HFileTarget is essentially identical to 
the one in FileTargetImpl that it overrides. We can remove it and just share 
the same code.

* It could be useful to add a property for configuring the max DistCp task 
bandwidth, as the default (100 MB/s per task) may be too high for certain 
environments.

* The default of 1000 for max DistCp map tasks may be too high in some 
situations resulting in 503 Slow Down errors from S3. Reducing to 100 should 
help prevent issues along those lines while still providing adequate parallel 
throughput.

  was:
As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
improvements have been identified during testing.

* We need to preserve preferred part names, e.g. part-m-00000. Currently the 
DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
method, and would therefore create destination file names like out0-m-00000, 
which are problematic when there are multiple map-only jobs writing to the same 
target path. This can be achieved by providing a custom CopyListing 
implementation that is capable of dynamically renaming target paths based on a 
given mapping. Unfortunately a substantial amount of code duplication from the 
original SimpleCopyListing class is currently required in order to inject the 
necessary logic for modifying the sequence file entry keys. HADOOP-16147 has 
been opened to allow it to be simplified in the future.

* The handleOutputs implementation in HFileTarget is essentially identical to 
the one in FileTargetImpl that it overrides. We can remove it and just share 
the same code.

* It could be useful to add a property for configuring the max DistCp task 
bandwidth, as the default (100 MB/s per task) may be too high for certain 
environments.


> Improvements for usage of DistCp
> --------------------------------
>
>                 Key: CRUNCH-679
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-679
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Andrew Olson
>            Assignee: Josh Wills
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-00000. Currently the 
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
> method, and would therefore create destination file names like out0-m-00000, 
> which are problematic when there are multiple map-only jobs writing to the 
> same target path. This can be achieved by providing a custom CopyListing 
> implementation that is capable of dynamically renaming target paths based on 
> a given mapping. Unfortunately a substantial amount of code duplication from 
> the original SimpleCopyListing class is currently required in order to inject 
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147 
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to 
> the one in FileTargetImpl that it overrides. We can remove it and just share 
> the same code.
> * It could be useful to add a property for configuring the max DistCp task 
> bandwidth, as the default (100 MB/s per task) may be too high for certain 
> environments.
> * The default of 1000 for max DistCp map tasks may be too high in some 
> situations resulting in 503 Slow Down errors from S3. Reducing to 100 should 
> help prevent issues along those lines while still providing adequate parallel 
> throughput.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to