Andrew Olson created CRUNCH-679: ----------------------------------- Summary: Improvements for usage of DistCp Key: CRUNCH-679 URL: https://issues.apache.org/jira/browse/CRUNCH-679 Project: Crunch Issue Type: Improvement Components: Core Reporter: Andrew Olson Assignee: Josh Wills
As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and improvements have been identified during testing. * We need to preserve preferred part names, e.g. part-m-00000. Currently the DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile method, and would therefore create destination file names like out0-m-00000, which are problematic when there are multiple map-only jobs writing to the same target path. This can be achieved by providing a custom CopyListing implementation that is capable of dynamically renaming target paths based on a given mapping. Unfortunately a substantial amount of code duplication from the original SimpleCopyListing class is currently required in order to inject the necessary logic for modifying the sequence file entry keys. HADOOP-16147 has been opened to allow it to be simplified in the future. * The handleOutputs implementation in HFileTarget is essentially identical to the one in FileTargetImpl that it overrides. We can remove it and just share the same code. * It could be useful to add a property for configuring the max DistCp task bandwidth, as the default (100 MB/s per task) may be too high for certain environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)