[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

Sahil Takiar (JIRA) Thu, 10 Nov 2016 12:13:12 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654993#comment-15654993
 ]


Sahil Takiar commented on HADOOP-13600:
---------------------------------------

[[email protected]] I created a Pull Request: 
https://github.com/apache/hadoop/pull/157

Let me know what you think of my approach. I verified that the the S3 unit 
tests pass, but have not run the integration tests yet.

The patch is pretty simple, but its different from the approach you outlined in 
HIVE-15093. Below are some notes:

* A new method called {{copyFileAsync}} was added which returns a {{Copy}} 
object, the original method {{copyFile}} is still there but it just invokes 
{{copyFileAsync(...).waitForCopyResult()}}
* Deletes are done inside the {{ProgressListener}}, I removed the logic in 
{{rename(...)}} that issues bulk delete requests
** I'm assuming the {{ProgressListener}} is invoked by the same thread that is 
issuing the copy request (correct me if I am wrong)
** The drawback is that more calls to S3 are made since delete ops aren't 
grouped together, but the advantage is that deletes are now done across 
multiple threads
*** Let me know if you think this scales. Another benefit of my approach is 
that the logic is much simpler. If we need bulk delete ops then some type of 
intermediate blocking queue may be necessary
* I'm not entirely sure how to make the listing sequential, the API seems to 
suggest you have to sequentially call {{listNextBatchOfObjects(...)}}

> S3a rename() to copy files in a directory in parallel
> -----------------------------------------------------
>
>                 Key: HADOOP-13600
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13600
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

Reply via email to