OliLay opened a new pull request, #41564:
URL: https://github.com/apache/arrow/pull/41564

   ### Rationale for this change
   
   See #40557. The previous implementation would always issue multi part 
uploads which come with 3x RTT to S3 instead of just 1x RTT with a `PutObject` 
request. 
   
   ### What changes are included in this PR?
   
   Implement logic in the S3 `OutputStream` to use a `PutObject` request if 
data is below a certain threshold (5 MB) and the output stream is closed. If 
more data is written, a multi part upload is triggered. Note: Previously, 
opening the output stream was already expensive because the 
`CreateMultipartUpload` request was triggered then. With this change opening 
the output stream becomes cheap, as we rather wait until some data is written 
to decide which upload method to use. This required some more state-keeping in 
the output stream class.
   
   ### Are these changes tested?
   
   No new tests were added, as there are already tests for very small writes 
and very large writes, which will trigger both ways of uploading. Everything 
should therefore be covered by existing tests.
   
   ### Are there any user-facing changes?
   
   - Previously, we would fail when opening the output stream if the object 
already exists. We inferred that by sending the `CreateMultipartUpload` 
request, which we now do not send anymore upon opening the stream. We now 
rather fail at closing, or at writing (when >5MB have accumulated). Replicating 
the old behavior is not possible without sending another request which defeats 
the purpose of this performance optimization. I hope this is fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to