[dev] [jira] [Created] (LIBCLOUD-269) Multipart upload for amazon S3

Mahendra M (JIRA) Thu, 20 Dec 2012 01:49:18 -0800

Mahendra M created LIBCLOUD-269:
-----------------------------------

             Summary: Multipart upload for amazon S3
                 Key: LIBCLOUD-269
                 URL: https://issues.apache.org/jira/browse/LIBCLOUD-269
             Project: Libcloud
          Issue Type: Improvement
          Components: Storage
            Reporter: Mahendra M



This patch adds support for streaming data upload using Amazon's multipart 
upload support as listed in 
(http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)

As per current behaviour, the upload_object_via_stream() API will download the 
entire object in memory, and then upload it to S3. This can turn problematic 
with large files (think HD videos around 4GB). This will be a huge hit in 
performance and memory of the python application.

With this patch, the API upload_object_via_stream() will use the S3 multipart 
upload feature to upload data in 5MB chunks, thus reducing the overall memory 
impact of the application.

Design of this feature:
* The S3StorageDriver() is not used just for Amazon S3. It is subclassed for 
being used with other S3 compliant cloud storage providers like Google Storage.
* The Amazon S3 multipart upload is not (or may not be) supported by other 
storage providers (who will prefer the Chunked upload mechanism)

We can solve this problem in two ways:
1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver), which 
implements this new multipart upload mechanism. Other storage providers will 
subclass S3StorageDriver. This is a more cleaner approach.
2) Introduce an attribute supports_s3_multipart_upload and based on it's value, 
control the callback function passed to _put_object() API. This makes the code 
look a bit hacky, but this approach is better for supporting such features in 
the future. We don't have to keep making subclasses for each feature.

In the current patch, I have implemented approach (2), though I prefer (1). 
After discussions with the community and knowing their preferrences, we can 
select a final approach.

Design notes:
* Implementation has three steaps
  1) POST request to /container/object_name?uploads. This returns an XML with a 
unique uploadId. This is handled as part of _put_object(). Doing it via 
_put_object() ensures that all S3 related parameters are set correctly.
  2) Upload each chunk via POST to 
/container/object_name?partNumber=X&uploadId=*** - This is done via the 
callback that is passed to _put_object() named _upload_multipart()
  3) POST an XML containing part-numbers and etag headers returned for each 
part to /container/object_name?uploadId=***, implemented via _commit_multipart()
  4) In case of any failures in steps (2) or (3), the upload is deleted from S3 
through a DELETE request to /container/object_name?uploadId=****, implemented 
via _abort_multipart()

* The chunk size for upload was set as 5MB - This is the minimum allowed size 
as per Amazon S3 docs.

Other changes:
* Did some PEP8 cleanup on s3.py

* s3.get_container() would iterate through the list of containers for finding 
the requested entry. This can be simplified by making a HEAD request. The only 
downside is that 'created_time' is not available for the container. Let me know 
if this approach is OK or if I must revert it.

* Introduced the following APIs for the S3StorageDriver(), to make some 
functionality easier.
  get_container_cdn_url()
  get_object_cdn_url()

* In libcloud.common.base.Connection, the request() method is used as the basis 
for all HTTP requests made by libcloud. This method had a limitation, which 
became apparent in S3 multipart upload implementation. For initializing an 
upload, the API invoked was
  /container/object_name?uploads
The 'uploads' parameter had to be passed as-is, without any values. If we made 
use of "params" argument in request() method, it would have come up as 
'uploads=***'. To prevent this, the 'action' was set to 
/container/object_name?uploads and slight modifications were made to how 
parameters were appended.

This also forced a change in BaseMockHttpObject._get_method_name()

Bug fixes in test framework
* While working on the test cases, I noticed a small issue. Not sure if it was 
a bug or as per design.
  MockRawResponse._get_response_if_not_availale() would return two different 
values on subsequent invocations.
     if not self._response:
         ...
         return self  <----- this was inconsistent.
     return self._response

  While adding test cases for the Amazon S3 functionality, I noticed that 
instead of getting back MockResponse, I was getting MockRawResponse instance 
(which did not have methods like read()) or parse_body(). So, I fixed this 
issue. Because of this other test cases started failing and they were 
subsequently fixed. Not sure if this has to be fixed or if it was done on 
purpose. If someone can throw some light on it, I can work on it further. As of 
now, all test cases pass.

* In test_s3.py, the driver was being set everywhere to S3StorageDriver. This 
same test case is used for GoogleStorageDriver, where the driver turns up as 
S3StorageDriver instead of GoogleStorageDriver. This was fixed by changing code 
to driver=self.driver_type


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[dev] [jira] [Created] (LIBCLOUD-269) Multipart upload for amazon S3

Reply via email to