Mahendra M created LIBCLOUD-269:
-----------------------------------
Summary: Multipart upload for amazon S3
Key: LIBCLOUD-269
URL: https://issues.apache.org/jira/browse/LIBCLOUD-269
Project: Libcloud
Issue Type: Improvement
Components: Storage
Reporter: Mahendra M
This patch adds support for streaming data upload using Amazon's multipart
upload support as listed in
(http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)
As per current behaviour, the upload_object_via_stream() API will download the
entire object in memory, and then upload it to S3. This can turn problematic
with large files (think HD videos around 4GB). This will be a huge hit in
performance and memory of the python application.
With this patch, the API upload_object_via_stream() will use the S3 multipart
upload feature to upload data in 5MB chunks, thus reducing the overall memory
impact of the application.
Design of this feature:
* The S3StorageDriver() is not used just for Amazon S3. It is subclassed for
being used with other S3 compliant cloud storage providers like Google Storage.
* The Amazon S3 multipart upload is not (or may not be) supported by other
storage providers (who will prefer the Chunked upload mechanism)
We can solve this problem in two ways:
1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver), which
implements this new multipart upload mechanism. Other storage providers will
subclass S3StorageDriver. This is a more cleaner approach.
2) Introduce an attribute supports_s3_multipart_upload and based on it's value,
control the callback function passed to _put_object() API. This makes the code
look a bit hacky, but this approach is better for supporting such features in
the future. We don't have to keep making subclasses for each feature.
In the current patch, I have implemented approach (2), though I prefer (1).
After discussions with the community and knowing their preferrences, we can
select a final approach.
Design notes:
* Implementation has three steaps
1) POST request to /container/object_name?uploads. This returns an XML with a
unique uploadId. This is handled as part of _put_object(). Doing it via
_put_object() ensures that all S3 related parameters are set correctly.
2) Upload each chunk via POST to
/container/object_name?partNumber=X&uploadId=*** - This is done via the
callback that is passed to _put_object() named _upload_multipart()
3) POST an XML containing part-numbers and etag headers returned for each
part to /container/object_name?uploadId=***, implemented via _commit_multipart()
4) In case of any failures in steps (2) or (3), the upload is deleted from S3
through a DELETE request to /container/object_name?uploadId=****, implemented
via _abort_multipart()
* The chunk size for upload was set as 5MB - This is the minimum allowed size
as per Amazon S3 docs.
Other changes:
* Did some PEP8 cleanup on s3.py
* s3.get_container() would iterate through the list of containers for finding
the requested entry. This can be simplified by making a HEAD request. The only
downside is that 'created_time' is not available for the container. Let me know
if this approach is OK or if I must revert it.
* Introduced the following APIs for the S3StorageDriver(), to make some
functionality easier.
get_container_cdn_url()
get_object_cdn_url()
* In libcloud.common.base.Connection, the request() method is used as the basis
for all HTTP requests made by libcloud. This method had a limitation, which
became apparent in S3 multipart upload implementation. For initializing an
upload, the API invoked was
/container/object_name?uploads
The 'uploads' parameter had to be passed as-is, without any values. If we made
use of "params" argument in request() method, it would have come up as
'uploads=***'. To prevent this, the 'action' was set to
/container/object_name?uploads and slight modifications were made to how
parameters were appended.
This also forced a change in BaseMockHttpObject._get_method_name()
Bug fixes in test framework
* While working on the test cases, I noticed a small issue. Not sure if it was
a bug or as per design.
MockRawResponse._get_response_if_not_availale() would return two different
values on subsequent invocations.
if not self._response:
...
return self <----- this was inconsistent.
return self._response
While adding test cases for the Amazon S3 functionality, I noticed that
instead of getting back MockResponse, I was getting MockRawResponse instance
(which did not have methods like read()) or parse_body(). So, I fixed this
issue. Because of this other test cases started failing and they were
subsequently fixed. Not sure if this has to be fixed or if it was done on
purpose. If someone can throw some light on it, I can work on it further. As of
now, all test cases pass.
* In test_s3.py, the driver was being set everywhere to S3StorageDriver. This
same test case is used for GoogleStorageDriver, where the driver turns up as
S3StorageDriver instead of GoogleStorageDriver. This was fixed by changing code
to driver=self.driver_type
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira