[dev] libcloud pull request: LIBCLOUD-269 : Multipart upload for amazon S3

mahendra Thu, 20 Dec 2012 02:02:27 -0800

GitHub user mahendra opened a pull request:

    https://github.com/apache/libcloud/pull/80


    LIBCLOUD-269 : Multipart upload for amazon S3

    This patch adds support for streaming data upload using Amazon's multipart 
upload support as listed in [S3 
docs](http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)
    
    As per current behaviour, the ```upload_object_via_stream()``` API will 
collect the entire object in memory, and then upload it to S3. This can turn 
problematic with large files (think HD videos around 4GB). This will be a huge 
hit in performance and memory of the python application.
    
    With this patch, the API ```upload_object_via_stream()``` will use the S3 
multipart upload feature to upload data in 5MB chunks, thus reducing the 
overall memory impact of the application.
    
    ## Design of this feature:
    * The ```S3StorageDriver()``` is not used just for Amazon S3. It is 
sub-classed for being used with other S3 compliant cloud storage providers like 
Google Storage.
    * The Amazon S3 multipart upload is not (or may not be) supported by other 
storage providers (who will prefer the chunked upload mechanism)
    
    We can solve this problem in two ways:
    * Create a new subclass of ```S3StorageDriver``` (say 
```AmazonS3StorageDriver```), which implements this new multipart upload 
mechanism. Other storage providers will subclass ```S3StorageDriver```. This is 
a more cleaner approach.
    * Introduce an attribute ```supports_s3_multipart_upload``` and based on 
it's value, control the callback function passed to ```_put_object()``` API. 
This makes the code look a bit hacky, but this approach is better for 
supporting such features in the future. We don't have to keep making 
sub-classes for each feature.
    
    In the current patch, I have implemented the latter approach, though I 
prefer the former. After discussions with the community and knowing their 
preferences, we can select a final approach.
    
    ## Design notes:
    * Implementation has three steps
      1. ```POST``` request to ```/container/object_name?uploads```. This 
returns an XML with a unique ```uploadId```. This is handled as part of 
```_put_object()```. Doing it via ```_put_object()``` ensures that all S3 
related parameters are set correctly.
      2. Upload each chunk via ```PUT``` to 
```/container/object_name?partNumber=X&uploadId=***``` - This is done via the 
callback that is passed to ```_put_object()``` named ```_upload_multipart()```
      3. ```POST``` an XML containing part-numbers and etag headers returned 
for each part to ```/container/object_name?uploadId=***```, implemented via 
```_commit_multipart()```
      4. In case of any failures in steps (2) or (3), the upload is deleted 
from S3 through a ```DELETE``` request to 
```/container/object_name?uploadId=****```, implemented via 
```_abort_multipart()```
    
    * The chunk size for upload was set as 5MB - This is the minimum allowed 
size as per Amazon S3 docs.
    
    ## Other changes:
    * Did some PEP8 cleanup on s3.py
    
    * ```s3.get_container()``` would iterate through the list of containers for 
finding the requested entry. This can be simplified by making a ```HEAD``` 
request. The only downside is that ```created_time``` is not available for the 
container. Let me know if this approach is OK or if I must revert it.
    
    * Introduced the following APIs for the ```S3StorageDriver()```, to make 
some functionality easier.
     * ```get_container_cdn_url()```
     * ```get_object_cdn_url()```
    
    * In ```libcloud.common.base.Connection```, the ```request()``` method is 
used as the basis for all HTTP requests made by libcloud. This method had a 
limitation, which became apparent in S3 multipart upload implementation. For 
initializing an upload, the API invoked was  
```/container/object_name?uploads```.  The ```uploads``` parameter had to be 
passed as-is, without any values. If we made use of ```params``` argument in 
```request()``` method, it would have come up as ```uploads=***```. To prevent 
this, the ```action``` was set to ```/container/object_name?uploads``` and 
slight modifications were made to how parameters were appended. This also 
forced a change in ```BaseMockHttpObject._get_method_name()```
    
    ## Bug fixes in test framework
    * While working on the test cases, I noticed a small issue. Not sure if it 
was a bug or as per design. 
```MockRawResponse._get_response_if_not_availale()``` would return two 
different values on subsequent invocations.
    ```
         if not self._response:
             ...
             return self  <----- this was inconsistent.
         return self._response
    ```
      While adding test cases for the Amazon S3 functionality, I noticed that 
instead of getting back ```MockResponse```, I was getting ```MockRawResponse``` 
instance (which did not have methods like ```read()``` or ```parse_body()```). 
So, I fixed this issue. Because of this other test cases started failing and 
they were subsequently fixed. Not sure if this has to be fixed or if it was 
done on purpose. If someone can throw some light on it, I can work on it 
further. As of now, all test cases pass.
    
    * In test_s3.py, the driver was being set everywhere to 
```S3StorageDriver```. This same test case is used for 
```GoogleStorageDriver```, where the driver turns up as ```S3StorageDriver``` 
instead of ```GoogleStorageDriver```. This was fixed by changing code to 
```driver=self.driver_type```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mahendra/libcloud hash

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/libcloud/pull/80.patch

----
commit 94c45ce21938ee8fe974ffbaf3a4da37bc445514
Author: Mahendra M <[email protected]>
Date:   2012-11-18T04:06:59Z

    LIBCLOUD-261 : Iterator based API for container listing

commit 6f2f6a190d361b9c6886e37b7e1164f033c97e50
Author: Mahendra M <[email protected]>
Date:   2012-11-21T11:24:43Z

    LIBCLOUD-265 : In local storage, remove empty parent folders

commit aaa139aa39f7f684e058e124b77e207ff3a55486
Author: Mahendra M <[email protected]>
Date:   2012-11-21T11:31:33Z

    Merge branch 'trunk' into hash
    
    Conflicts:
        libcloud/storage/drivers/cloudfiles.py
        libcloud/storage/drivers/nimbus.py

commit 8ee0b23a9880a03a1cfeb3b8cbda757bd0a9d71f
Author: Mahendra M <[email protected]>
Date:   2012-12-19T13:36:12Z

    Merge branch 'trunk' into hash

commit 6621ecd33853420a28ba80c1ed9a0ec59d3a25fe
Author: Mahendra M <[email protected]>
Date:   2012-12-20T06:23:08Z

    Merge branch 'trunk' into hash

commit 0514c5c175168ff0f66fd3e4b7ff6096ed0fe8b9
Author: Mahendra M <[email protected]>
Date:   2012-12-20T09:47:34Z

    LIBCLOUD-269 : Multipart upload for amazon S3

----

[dev] libcloud pull request: LIBCLOUD-269 : Multipart upload for amazon S3

Reply via email to