Corby Wilson created HADOOP-11281: ------------------------------------- Summary: Add flag to fs.shell to skip _COPYING_ file Key: HADOOP-11281 URL: https://issues.apache.org/jira/browse/HADOOP-11281 Project: Hadoop Common Issue Type: Improvement Components: fs, fs/s3 Environment: Hadoop 2.2 but is in all of them. AWS EMR 3.0.4 Reporter: Corby Wilson Priority: Critical
Amazon S3 does not have a rename feature. When you use the hadoop shell or distcp feature, hadoop first uploads the file using the ._COPYING_ extension, then renames the file to the final output. Code: org/apache/hadoop/fs/shell/CommandWithDestination.java PathData tempTarget = target.suffix("._COPYING_"); targetFs.setWriteChecksum(writeChecksum); targetFs.writeStreamToFile(in, tempTarget, lazyPersist); targetFs.rename(tempTarget, target); The problem is that on rename, we actually have to download the file again (through an InputStream), then upload it again. For very large files (>= 5GB) we have to use multipart upload. So if we are processing several TB of multi GB files, we are actually writing the file to S3 twice and reading it once from S3. It would be nice to have a flag or core-site.xml setting that allowed us to tell hadoop to skip the copy and just write the file once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)