Andrew Olson created CRUNCH-679:
-----------------------------------

             Summary: Improvements for usage of DistCp
                 Key: CRUNCH-679
                 URL: https://issues.apache.org/jira/browse/CRUNCH-679
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Andrew Olson
            Assignee: Josh Wills


As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and 
improvements have been identified during testing.

* We need to preserve preferred part names, e.g. part-m-00000. Currently the 
DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile 
method, and would therefore create destination file names like out0-m-00000, 
which are problematic when there are multiple map-only jobs writing to the same 
target path. This can be achieved by providing a custom CopyListing 
implementation that is capable of dynamically renaming target paths based on a 
given mapping. Unfortunately a substantial amount of code duplication from the 
original SimpleCopyListing class is currently required in order to inject the 
necessary logic for modifying the sequence file entry keys. HADOOP-16147 has 
been opened to allow it to be simplified in the future.

* The handleOutputs implementation in HFileTarget is essentially identical to 
the one in FileTargetImpl that it overrides. We can remove it and just share 
the same code.

* It could be useful to add a property for configuring the max DistCp task 
bandwidth, as the default (100 MB/s per task) may be too high for certain 
environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to