[
https://issues.apache.org/jira/browse/HADOOP-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Olson updated HADOOP-16147:
----------------------------------
Fix Version/s: 3.3.0
> Allow CopyListing sequence file keys and values to be more easily customized
> ----------------------------------------------------------------------------
>
> Key: HADOOP-16147
> URL: https://issues.apache.org/jira/browse/HADOOP-16147
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Andrew Olson
> Assignee: Andrew Olson
> Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HADOOP-16147-001.patch, HADOOP-16147-002.patch
>
>
> We have encountered a scenario where, when using the Crunch library to run a
> distributed copy (CRUNCH-660, CRUNCH-675) at the conclusion of a job we need
> to dynamically rename target paths to the preferred destination output part
> file names, rather than retaining the original source path names.
> A custom CopyListing implementation appears to be the proper solution for
> this. However the place where the current SimpleCopyListing logic needs to be
> adjusted is in a private method (writeToFileListing), so a relatively large
> portion of the class would need to be cloned.
> To minimize the amount of code duplication required for such a custom
> implementation, we propose adding two new protected methods to the
> CopyListing class, that can be used to change the actual keys and/or values
> written to the copy listing sequence file:
> {noformat}
> protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus
> fileStatus);
> protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus
> fileStatus);
> {noformat}
> The SimpleCopyListing class would then be modified to consume these methods
> as follows,
> {noformat}
> fileListWriter.append(
> getFileListingKey(sourcePathRoot, fileStatus),
> getFileListingValue(fileStatus));
> {noformat}
> The default implementations would simply preserve the present behavior of the
> SimpleCopyListing class, and could reside in either CopyListing or
> SimpleCopyListing, whichever is preferable.
> {noformat}
> protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus
> fileStatus) {
> return new Text(DistCpUtils.getRelativePath(sourcePathRoot,
> fileStatus.getPath()));
> }
> protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus
> fileStatus) {
> return fileStatus;
> }
> {noformat}
> Please let me know if this proposal seems to be on the right track. If so I
> can provide a patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]