[
https://issues.apache.org/jira/browse/HADOOP-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220558#comment-14220558
]
Corby Wilson commented on HADOOP-11281:
---------------------------------------
7312 only handles the distcp case when you are actually calling 'hadoop jar
distcp'.
What this doesn't address is if the source/target file is located at s3://
For example, a mapper reads in an s3:// file, the reducer saves the output to
hdfs://
So it would be great if we could eihter:
1. Add a property to core-site that allow us to skip temp on rename/copy
(fs.s3.norename)
2. Add an FileSystem call so that CommandWithDestination calls down to the Fs
class to do the copy rather than doing it itself.
> Add flag to fs.shell to skip _COPYING_ file
> -------------------------------------------
>
> Key: HADOOP-11281
> URL: https://issues.apache.org/jira/browse/HADOOP-11281
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs, fs/s3
> Environment: Hadoop 2.2 but is in all of them.
> AWS EMR 3.0.4
> Reporter: Corby Wilson
> Priority: Critical
>
> Amazon S3 does not have a rename feature.
> When you use the hadoop shell or distcp feature, hadoop first uploads the
> file using the ._COPYING_ extension, then renames the file to the final
> output.
> Code:
> org/apache/hadoop/fs/shell/CommandWithDestination.java
> PathData tempTarget = target.suffix("._COPYING_");
> targetFs.setWriteChecksum(writeChecksum);
> targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
> targetFs.rename(tempTarget, target);
> The problem is that on rename, we actually have to download the file again
> (through an InputStream), then upload it again.
> For very large files (>= 5GB) we have to use multipart upload.
> So if we are processing several TB of multi GB files, we are actually writing
> the file to S3 twice and reading it once from S3.
> It would be nice to have a flag or core-site.xml setting that allowed us to
> tell hadoop to skip the copy and just write the file once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)