[ https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778487#comment-16778487 ]
Andrew Olson commented on CRUNCH-679: ------------------------------------- Pull request, https://github.com/apache/crunch/pull/20 > Improvements for usage of DistCp > -------------------------------- > > Key: CRUNCH-679 > URL: https://issues.apache.org/jira/browse/CRUNCH-679 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Andrew Olson > Assignee: Josh Wills > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and > improvements have been identified during testing. > * We need to preserve preferred part names, e.g. part-m-00000. Currently the > DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile > method, and would therefore create destination file names like out0-m-00000, > which are problematic when there are multiple map-only jobs writing to the > same target path. This can be achieved by providing a custom CopyListing > implementation that is capable of dynamically renaming target paths based on > a given mapping. Unfortunately a substantial amount of code duplication from > the original SimpleCopyListing class is currently required in order to inject > the necessary logic for modifying the sequence file entry keys. HADOOP-16147 > has been opened to allow it to be simplified in the future. > * The handleOutputs implementation in HFileTarget is essentially identical to > the one in FileTargetImpl that it overrides. We can remove it and just share > the same code. > * It could be useful to add a property for configuring the max DistCp task > bandwidth, as the default (100 MB/s per task) may be too high for certain > environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)