[jira] [Updated] (HADOOP-9454) Support multipart uploads for s3native

Jordan Mendelson (JIRA) Wed, 03 Apr 2013 22:19:23 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jordan Mendelson updated HADOOP-9454:
-------------------------------------

    Status: Patch Available  (was: Open)

Here is a patch against trunk which adds multipart upload support. It also 
updates the jets3t library to 0.90 (based on a patch in HADOOP-8136).

It is difficult to build automated tests against this as it requires a valid s3 
access key in order to test writing to S3 buckets, however I verified that it 
does indeed allow you to upload more than 5 GB files (tested by uploading an 8 
GB image image of my root filesystem, renaming it on S3 (requires a multipart 
upload copy), downloading it and comparing the md5sum) and continues to work as 
normal if fs.s3n.multipart.uploads.enabled is set to false and have run through 
various fs commands to verify that everything works as it should.

This patch adds two config options: fs.s3n.multipart.uploads.enabled and 
fs.s3n.multipart.uploads.block.size. The former was named after the Amazon 
setting which does the same thing (defaults to false). The latter controls the 
minimum filesize and the block size before multipart file uploads are used 
(default, 64 MB).

By default, jets3t will only spawn two threads to upload, but you can change 
this by setting the threaded-service.max-thread-count property in 
jets3t.properties file. I've tried with upwards of 20 threads and it is 
significantly faster.

This patch should work with only minor changes with older versions of hadoop as 
well since the s3native and s3 filesystems haven't changed much. I originally 
wrote it for CDH 4.

Please note, because of the way hadoop fs works, it requires a remote copy 
which for large files takes a while. This is because hadoop fs copies files 
with as filename._COPYING_ and then renames it. Unfortunately, there is no 
rename support on Amazon S3, so we must do a copy() then delete(). The copy() 
can take quite a while for large files. Also because of this, when multipart 
files is enabled, an additional request will be made to AWS when doing a copy 
to check if the source file is larger than 5 GB.
                
> Support multipart uploads for s3native
> --------------------------------------
>
>                 Key: HADOOP-9454
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9454
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Jordan Mendelson
>
> The s3native filesystem is limited to 5 GB file uploads to S3, however the 
> newest version of jets3t supports multipart uploads to allow storing multi-TB 
> files. While the s3 filesystem lets you bypass this restriction by uploading 
> blocks, it is necessary for us to output our data into Amazon's 
> publicdatasets bucket which is shared with others.
> Amazon has added a similar feature to their distribution of hadoop as has 
> MapR.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HADOOP-9454) Support multipart uploads for s3native

Reply via email to