[ 
https://issues.apache.org/jira/browse/JCLOUDS-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750372#comment-17750372
 ] 

Jan Vermeulen commented on JCLOUDS-1608:
----------------------------------------

I have run and debugged the MpuPartitioningAlgorithmTest, and that leads to the 
following conclusion: MultipartUploadSlicingAlgorithm has been conceived to 
calculate a division in x equal parts plus a remainder, where:
 * there is _always_ a remainder: if there is no remainder, the number of parts 
is reduced by one and a remainder is returned the size of a part. If the length 
is equal to or below the default partSize the number of parts is 0 and all is 
put into the remainder.

 * the maximumNumberOfParts only applies to the x equal parts and {_}disregards 
the remainder{_}: if you request a maximum of 100 parts, you get 99 equal parts 
and a remainder as long as the length does not exceed 
maxPartSize*maxNumberOfParts. But if it does, you get 100 equal parts of 
maxPartSize and a remainder with what did not fit. So that's 101 parts in total.

The problem is (considering this logic) not with the algorithm itself, but with 
how it is used in combination with GCS, that only accepts 32 parts in total 
(including the remainder). The maximumNumberOfParts defined in 
GoogleCloudStorageBlobStore is 32, which does not guarantee a total of 32 in 
all cases. Problem is that this is not accounted for anywhere in the blobstore: 
if there is an overflow, that is logged by the algorithm
{color:#000000} 
{color}{color:#0000c0}logger{color}{color:#000000}.debug({color}{color:#2a00ff}"
 %d bytes partitioned in %d parts of part size: %d, remaining: 
%d%s"{color}{color:#000000}, 
{color}{color:#6a3e3e}length{color}{color:#000000}, 
{color}{color:#6a3e3e}parts{color}{color:#000000}, 
{color}{color:#0000c0}chunkSize{color}{color:#000000},{color}

{color:#000000} {color}{color:#0000c0}remaining{color}{color:#000000}, 
{color}{color:#0000c0}remaining{color}{color:#000000} > 
{color}{color:#0000c0}maximumPartSize{color}{color:#000000} ? 
{color}{color:#2a00ff}" overflow!"{color}{color:#000000} : 
{color}{color:#2a00ff}""{color}{color:#000000});{color}

but apart from that nothing is done to prevent the blobstore from passing 33 
parts to GCS causing a failure.

Since the maxNumberOfParts is externally imposed, my patch handles that 
overflow inside the algorithm, trading in on the maxPartSize to comply with the 
imposed maxNumberOfParts. The following patched code accounts for the logic 
that there is always a remainder, and handles the case where there is an 
overflow:
 

{color:#7f0055}long{color}{color:#000000} 
{color}{color:#6a3e3e}remainder{color}{color:#000000} = 
{color}{color:#6a3e3e}length{color}{color:#000000} % 
{color}{color:#6a3e3e}unitPartSize{color}{color:#000000};{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
({color}{color:#6a3e3e}remainder{color}{color:#000000} == 0 && 
{color}{color:#6a3e3e}parts{color}{color:#000000} > 0) {{color}

{color:#000000} {color}{color:#6a3e3e}parts{color}{color:#000000} -= 1;{color}

{color:#000000} }{color}

{color:#000000} {color}

{color:#000000} {color}{color:#3f7f5f}// SHB patch{color}

{color:#000000} {color}{color:#3f7f5f}// remainder should be distributed over 
parts if we are at the maximumNumberOfParts{color}

{color:#000000} {color}{color:#3f7f5f}// (if not, an additional part is 
uploaded to GCS thus exceeding the maximum allowed parts){color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
({color}{color:#6a3e3e}remainder{color}{color:#000000} > 0 && 
{color}{color:#6a3e3e}parts{color}{color:#000000} == 
{color}{color:#0000c0}maximumNumberOfParts{color}{color:#000000}) {{color}

{color:#000000} {color}{color:#6a3e3e}parts{color}{color:#000000} -= 1;{color}

{color:#000000} {color}{color:#6a3e3e}partSize{color}{color:#000000} = 
{color}{color:#6a3e3e}length{color}{color:#000000}/{color}{color:#6a3e3e}parts{color}{color:#000000};{color}

{color:#000000} }{color}

{color:#000000} {color}{color:#3f7f5f}// end of SHB patch{color}

 

Using this corrected patch, there is only one check in the unittest that fails, 
namely the test where the length exceeds maxPartSize*maxNumberOfParts. If it is 
acceptable that the algoritm handles the overflow in such a way, the unit-test 
would have to be adapted to match this new behavior.

If it is not acceptable that the logic of the algorithm handles the overflow, 
another solution could be to redefine the maxNumberOfParts in 
GoogleCloudStorageBlobStore to 31, so it never exceeds 32 including the 
remainder. But that would imply that the overflow in the length would not be 
evenly redistributed over the parts.

It's not for me to decide what is best here, I only suggest a solution for a 
problem that (without a patch) prevents us from using the BlobStore 
implementation to upload large files to Google Cloud.

> Slicing of large files can lead to exceed the 32 parts limit of GCS
> -------------------------------------------------------------------
>
>                 Key: JCLOUDS-1608
>                 URL: https://issues.apache.org/jira/browse/JCLOUDS-1608
>             Project: jclouds
>          Issue Type: Bug
>          Components: jclouds-blobstore
>    Affects Versions: 2.2.1, 2.5.0
>            Reporter: Jan Vermeulen
>            Priority: Major
>              Labels: google-cloud-storage
>             Fix For: 2.2.1, 2.6.0
>
>
> MultipartUploadSlicingAlgorithm calculates slices for a large file by first 
> using the defaultPartSize. If the results of that slicing gives parts that 
> don't exceed the min/maxPartSizes (5MB and 5GB for 
> GoogleCloudStorageBlobStore) but that do exceed the maxNumberOfParts (32 for 
> GoogleCloudStorageBlobStore), the algoritm sets the number of parts to 32 and 
> recalculates the size of the parts. If there is any remainder after that, 
> JClouds ends up uploading 33 parts in total to GCS, causing the process to 
> fail in completeMultipartUpload() when recomposing the original content from 
> the parts.
> The following simple unitTest proves the case:
> {{public class AlgoritmTest extends TestCase {}}
> {{    public void testSlicing() {}}
> {{        MultipartUploadSlicingAlgorithm algorithm = new 
> MultipartUploadSlicingAlgorithm(1024*1024*5,1024*1024*1024*5,32);}}
> {{        algorithm.calculateChunkSize(1024*1024*1200+33);}}
> {{        assertTrue(algorithm.getParts()+((algorithm.getRemaining() > 
> 0)?(1):(0)) <= 32);}}
> {{}}}
> It simulates the slicing of a file of 1.2GB+33 bytes (to make sure there is a 
> remainder).
> The following patch fixes the issue:
> {{      ...}}
> {{      long remainder = length % unitPartSize;     }}
> {{      // SHB patch}}
> {{      // remainder should be distributed over parts if we are at the 
> maximumNumberOfParts}}
> {{      // (if not, an additional part is uploaded to GCS thus exceeding the 
> maximum allowed parts)}}
> {{      // if (remainder == 0 && parts > 0) {}}
> {{      //     parts -= 1;}}
> {{      if (remainder > 0 && parts == maximumNumberOfParts) {}}
> {{          parts -= 1;}}
> {{          partSize = length/parts;}}
>             {{// end of SHB patch}}
> {{      ...}}
> I also commented the code that reduces the number of parts when there is no 
> remainder, since that ends up creating a remaining part that is the same size 
> as the others.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to