Corby Wilson created HADOOP-11281:
-------------------------------------

             Summary: Add flag to fs.shell to skip _COPYING_ file
                 Key: HADOOP-11281
                 URL: https://issues.apache.org/jira/browse/HADOOP-11281
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs, fs/s3
         Environment: Hadoop 2.2 but is in all of them.
AWS EMR 3.0.4
            Reporter: Corby Wilson
            Priority: Critical


Amazon S3 does not have a rename feature.
When you use the hadoop shell or distcp feature, hadoop first uploads the file 
using the ._COPYING_ extension, then renames the file to the final output.

Code:
org/apache/hadoop/fs/shell/CommandWithDestination.java
      PathData tempTarget = target.suffix("._COPYING_");
      targetFs.setWriteChecksum(writeChecksum);
      targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
      targetFs.rename(tempTarget, target);

The problem is that on rename, we actually have to download the file again 
(through an InputStream), then upload it again.
For very large files (>= 5GB) we have to use multipart upload.
So if we are processing several TB of multi GB files, we are actually writing 
the file to S3 twice and reading it once from S3.

It would be nice to have a flag or core-site.xml setting that allowed us to 
tell hadoop to skip the copy and just write the file once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to