Andrew Olson created HADOOP-16147:
-------------------------------------

             Summary: Allow CopyListing sequence file keys and values to be 
more easily customized
                 Key: HADOOP-16147
                 URL: https://issues.apache.org/jira/browse/HADOOP-16147
             Project: Hadoop Common
          Issue Type: Improvement
          Components: tools/distcp
            Reporter: Andrew Olson


We have encountered a scenario where, when using the Crunch library to run a 
distributed copy (CRUNCH-660, CRUNCH-675) at the conclusion of a job we need to 
dynamically rename target paths to the preferred destination output part file 
names, rather than retaining the original source path names.

A custom CopyListing implementation appears to be the proper solution for this. 
However the place where the current SimpleCopyListing logic needs to be 
adjusted is in a private method (writeToFileListing), so a relatively large 
portion of the class would need to be cloned.

To minimize the amount of code duplication required for such a custom 
implementation, we propose adding two new protected methods to the CopyListing 
class, that can be used to change the actual keys and/or values written to the 
copy listing sequence file: 

{noformat}
protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus 
fileStatus);

protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus 
fileStatus);
{noformat}

The SimpleCopyListing class would then be modified to consume these methods as 
follows,
{noformat}
fileListWriter.append(
   getFileListingKey(sourcePathRoot, fileStatus),
   getFileListingValue(fileStatus));
{noformat}

The default implementations would simply preserve the present behavior of the 
SimpleCopyListing class, and could reside in either CopyListing or 
SimpleCopyListing, whichever is preferable.

{noformat}
protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus 
fileStatus) {
   return new Text(DistCpUtils.getRelativePath(sourcePathRoot, 
fileStatus.getPath()));
}

protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus 
fileStatus) {
   return fileStatus;
}
{noformat}

Please let me know if this proposal seems to be on the right track. If so I can 
provide a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to