crepererum commented on code in PR #561:
URL: 
https://github.com/apache/arrow-rs-object-store/pull/561#discussion_r2754193269


##########
src/aws/mod.rs:
##########
@@ -101,6 +101,74 @@ impl AmazonS3 {
     fn path_url(&self, path: &Path) -> String {
         self.client.config.path_url(path)
     }
+
+    /// Construct the payloads for a multipart copy operation.
+    fn multipart_copy_payloads<'a>(&self, from: &'a Path, size: u64) -> 
Vec<PutPartPayload<'a>> {
+        let part_size = self.client.config.multipart_copy_part_size;
+        if size <= part_size {
+            return vec![PutPartPayload::Copy(from)];
+        }
+        let mut payloads = Vec::new();
+        let mut offset = 0;
+        while offset < size {
+            let end = if size - offset <= part_size {
+                size
+            } else {
+                offset + part_size
+            };
+            payloads.push(PutPartPayload::CopyRange(from, offset..end));
+            offset = end;
+        }
+        payloads
+    }
+
+    /// Perform a multipart copy operation
+    ///
+    /// If the multipart upload fails, this function makes a best effort 
attempt to clean it up.
+    /// It's the caller's responsibility to add a lifecycle rule if guaranteed 
cleanup is required,
+    /// as we cannot protect against an ill-timed process crash.
+    async fn copy_multipart(
+        &self,
+        from: &Path,
+        to: &Path,
+        size: u64,
+        mode: CompleteMultipartMode,
+    ) -> Result<()> {
+        // Perform multipart copy using UploadPartCopy
+        let upload_id = self
+            .client
+            .create_multipart(to, PutMultipartOptions::default())
+            .await?;
+
+        let mut parts = Vec::new();
+        for (idx, payload) in self

Review Comment:
   Technically we could do perform these requests concurrently (at least up to 
a certain amount), but I think the serial implementation is "good enough" for 
now. It's certainly better than the status quo (i.e. not being able to copy the 
object at all).



##########
src/aws/mod.rs:
##########
@@ -315,13 +374,37 @@ impl ObjectStore for AmazonS3 {
             extensions: _,
         } = options;
 
+        let copy_method = if let Some(limit) = 
self.client.config.multipart_copy_threshold {
+            let size = self
+                .client
+                .get_opts(from, GetOptions::new().with_head(true))
+                .await?
+                .meta
+                .size;
+            if size < limit {
+                CopyMethod::SingleRequest
+            } else {
+                CopyMethod::Multipart(size)
+            }
+        } else {
+            CopyMethod::SingleRequest
+        };
+
         match mode {
             CopyMode::Overwrite => {
-                self.client
-                    .copy_request(from, to)
-                    .idempotent(true)
-                    .send()
-                    .await?;
+                match copy_method {
+                    CopyMethod::SingleRequest => {
+                        self.client
+                            .copy_request(from, to)
+                            .idempotent(true)
+                            .send()
+                            .await?;
+                    }
+                    CopyMethod::Multipart(size) => {
+                        self.copy_multipart(from, to, size, 
CompleteMultipartMode::Overwrite)
+                            .await?;

Review Comment:
   ```suggestion
                           tracing::debug!(%from, %to, "multipart copy");
                           self.copy_multipart(from, to, size, 
CompleteMultipartMode::Overwrite)
                               .await?;
   ```
   
   (you may need to check if `Path` implements `Display` or not)
   
   Reasoning: we are changing the client behavior dynamically based on data 
size and if a server misbehaves due to that, it would be nice if we would at 
least leave some breadcrumbs.



##########
src/aws/builder.rs:
##########
@@ -189,6 +199,11 @@ pub struct AmazonS3Builder {
     request_payer: ConfigValue<bool>,
     /// The [`HttpConnector`] to use
     http_connector: Option<Arc<dyn HttpConnector>>,
+    /// Threshold (bytes) above which copy uses multipart copy. If not set, 
all copies are performed
+    /// as single requests.
+    multipart_copy_threshold: Option<ConfigValue<u64>>,
+    /// Preferred multipart copy part size (bytes). If not set, defaults to 5 
GiB.

Review Comment:
   That's a question I have but that I also think should be included in the 
doc-string: if the object was created using a multi-part upload, are there any 
requirements on the alignment of the copy-parts or is this irrelevant?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to