[ https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778424#comment-16778424 ]
Andrew Olson commented on CRUNCH-679: ------------------------------------- I will open a pull request for these changes later today. > Improvements for usage of DistCp > -------------------------------- > > Key: CRUNCH-679 > URL: https://issues.apache.org/jira/browse/CRUNCH-679 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Andrew Olson > Assignee: Josh Wills > Priority: Major > > As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and > improvements have been identified during testing. > * We need to preserve preferred part names, e.g. part-m-00000. Currently the > DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile > method, and would therefore create destination file names like out0-m-00000, > which are problematic when there are multiple map-only jobs writing to the > same target path. This can be achieved by providing a custom CopyListing > implementation that is capable of dynamically renaming target paths based on > a given mapping. Unfortunately a substantial amount of code duplication from > the original SimpleCopyListing class is currently required in order to inject > the necessary logic for modifying the sequence file entry keys. HADOOP-16147 > has been opened to allow it to be simplified in the future. > * The handleOutputs implementation in HFileTarget is essentially identical to > the one in FileTargetImpl that it overrides. We can remove it and just share > the same code. > * It could be useful to add a property for configuring the max DistCp task > bandwidth, as the default (100 MB/s per task) may be too high for certain > environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)