[GitHub] [jclouds] gaul commented on a change in pull request #89: JCLOUDS-1521: Remove need for content length when using multipart upload

GitBox Tue, 01 Dec 2020 03:28:20 -0800


gaul commented on a change in pull request #89:
URL: https://github.com/apache/jclouds/pull/89#discussion_r533335820




##########
File path: 
blobstore/src/main/java/org/jclouds/blobstore/internal/BaseBlobStore.java
##########
@@ -354,48 +350,22 @@ protected String putMultipartBlob(String container, Blob 
blob, PutOptions overri
    protected String putMultipartBlob(String container, Blob blob, PutOptions 
overrides, ListeningExecutorService executor) {
       ArrayList<ListenableFuture<MultipartPart>> parts = new 
ArrayList<ListenableFuture<MultipartPart>>();
       MultipartUpload mpu = initiateMultipartUpload(container, 
blob.getMetadata(), overrides);
-      // Cannot slice InputStream Payload since slice and close mutate the
-      // underlying stream.  Also issue synchronous uploads to avoid buffering
-      // arbitrary amounts of data in-memory.
       Payload payload = blob.getPayload();
-      boolean repeatable = blob.getPayload().isRepeatable();
-      if (!repeatable) {
-         payload = Payloads.newInputStreamPayload(new 
FilterInputStream((InputStream) payload.getRawContent()) {
-            @Override
-            public long skip(long offset) throws IOException {
-               // intentionally not implemented
-               return offset;
-            }
-
-            @Override
-            public void close() throws IOException {
-               // intentionally not implemented
-            }
-         });
-      }
-
-      try {
-         long contentLength = 
blob.getMetadata().getContentMetadata().getContentLength();
-         // TODO: inject MultipartUploadSlicingAlgorithm to override default 
part size
-         MultipartUploadSlicingAlgorithm algorithm = new 
MultipartUploadSlicingAlgorithm(
-               getMinimumMultipartPartSize(), getMaximumMultipartPartSize(), 
getMaximumNumberOfParts());
-         long partSize = algorithm.calculateChunkSize(contentLength);
-         int partNumber = 1;
-         while (partNumber <= algorithm.getParts()) {
-            Payload slice = slicer.slice(payload, algorithm.getCopied(), 
partSize);
+      boolean repeatable = payload.isRepeatable();
+      int partSize = (int) (getMinimumMultipartPartSize() + 
getMaximumMultipartPartSize()) / 2;
+      int partNumber = 1;
+      int read;
+
+      try (InputStream is = payload.openStream()) {
+         for (byte[] buffer = new byte[partSize]; (read = is.read(buffer)) > 
-1;) {
+            Payload slice = Payloads.newByteArrayPayload(read == partSize ? 
buffer : Arrays.copyOf(buffer, read));

Review comment:
       It has been a while since I looked at this code, but I believe that the 
previous slicing logic could efficiently create parts from repeatable 
`Payload`s, minimizing memory use.  This code will buffer the entire payload 
in-memory which won't work when users upload 5 TB objects.

##########
File path: 
blobstore/src/main/java/org/jclouds/blobstore/internal/BaseBlobStore.java
##########
@@ -354,48 +350,22 @@ protected String putMultipartBlob(String container, Blob 
blob, PutOptions overri
    protected String putMultipartBlob(String container, Blob blob, PutOptions 
overrides, ListeningExecutorService executor) {
       ArrayList<ListenableFuture<MultipartPart>> parts = new 
ArrayList<ListenableFuture<MultipartPart>>();
       MultipartUpload mpu = initiateMultipartUpload(container, 
blob.getMetadata(), overrides);
-      // Cannot slice InputStream Payload since slice and close mutate the
-      // underlying stream.  Also issue synchronous uploads to avoid buffering
-      // arbitrary amounts of data in-memory.
       Payload payload = blob.getPayload();
-      boolean repeatable = blob.getPayload().isRepeatable();
-      if (!repeatable) {
-         payload = Payloads.newInputStreamPayload(new 
FilterInputStream((InputStream) payload.getRawContent()) {
-            @Override
-            public long skip(long offset) throws IOException {
-               // intentionally not implemented
-               return offset;
-            }
-
-            @Override
-            public void close() throws IOException {
-               // intentionally not implemented
-            }
-         });
-      }
-
-      try {
-         long contentLength = 
blob.getMetadata().getContentMetadata().getContentLength();
-         // TODO: inject MultipartUploadSlicingAlgorithm to override default 
part size
-         MultipartUploadSlicingAlgorithm algorithm = new 
MultipartUploadSlicingAlgorithm(
-               getMinimumMultipartPartSize(), getMaximumMultipartPartSize(), 
getMaximumNumberOfParts());
-         long partSize = algorithm.calculateChunkSize(contentLength);
-         int partNumber = 1;
-         while (partNumber <= algorithm.getParts()) {
-            Payload slice = slicer.slice(payload, algorithm.getCopied(), 
partSize);
+      boolean repeatable = payload.isRepeatable();
+      int partSize = (int) (getMinimumMultipartPartSize() + 
getMaximumMultipartPartSize()) / 2;

Review comment:
       For the common blobstore, S3, this will calculate (5 MB + 5 GB) / 2 = 
2.5 GB.  In addition to overflowing an `int`, this is suboptimal for a lot of 
cases.  Most users are best served by several 5 MB parts uploading in parallel. 
 Can you devise a better algorithm that grows as the part size increases, e.g., 
parts 1-1000 are 5 MB, parts 2000-3000 are 25 MB, etc.?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [jclouds] gaul commented on a change in pull request #89: JCLOUDS-1521: Remove need for content length when using multipart upload

Reply via email to