subject:"\[jira\] \[Commented\] \(HADOOP\-13600\) S3a rename\(\) to copy files in a directory in parallel"

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2019-05-16 Thread Sahil Takiar (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841467#comment-16841467
 ] 

Sahil Takiar commented on HADOOP-13600:
---

No longer working on this, so marking as unassigned.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Priority: Major
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2019-03-18 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795343#comment-16795343
 ] 

Hadoop QA commented on HADOOP-13600:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} HADOOP-13600 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HADOOP-13600 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12886680/HADOOP-13600.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/16064/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2019-03-18 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795337#comment-16795337
 ] 

Steve Loughran commented on HADOOP-13600:
-

I'm reviewing this again. Nominally, the S3 transfer manager is paralellized 
anyway. 

But if a many GB copy is taking place there, all the small copy operations 
which follow are being held up, even though many of them could be executed. So 
yes, we do need something to do work in batches.

HADOOP-16189 looks at moving away from the transfer manager and doing it 
ourselves. I'm not yet ready to take that on, but the 200 error of HADOOP-16188 
means I have some doubts now about its longevity. I just don't want to rush 
into that.

* We know rename will never go away, it's too ubiquitous
* we know that directory renames is a major bottleneck in things. Even "hadoop 
fs -rm" commands, let along large hive jobs.
* if we can show tangible speedup, it's justifed

But: we need to retain consistency with s3Guard in the presence of failure. 
Proposed: after every copy call completes, S3Guard is updated immediately with 
the info about that dir existing. We'll update the delete calls after every 
bulk delete

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2018-07-24 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554959#comment-16554959
 ] 

genericqa commented on HADOOP-13600:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} HADOOP-13600 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HADOOP-13600 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12886680/HADOOP-13600.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14939/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2018-01-12 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324733#comment-16324733
 ] 

genericqa commented on HADOOP-13600:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  7s{color} 
| {color:red} HADOOP-13600 does not apply to branch-2. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HADOOP-13600 |
| GITHUB PR | https://github.com/apache/hadoop/pull/167 |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/13964/console |
| Powered by | Apache Yetus 0.7.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2018-01-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324725#comment-16324725
 ] 

Steve Loughran commented on HADOOP-13600:
-

Not looked at this for a while, I'll try and take a look in detail, especially 
now the committer is merged in.

One thing I've realised is our copy operation isn't doing what we do elsewhere; 
shuffle the list of files so there's more scattering of workload across shards 
in the bucket, so less risk of throttling.

* from the list, grab the first batch to copy (say, same amount as we can 
delete in a single batch)
* pick out the first few largest files to start copying first
* shuffle the rest of the batch

This is what I've done in 
[cloudup|https://github.com/steveloughran/cloudup/blob/master/src/main/java/org/apache/hadoop/tools/cloudup/Cloudup.java]
 and I believe it makes for a fast upload

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166384#comment-16166384
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on the issue:

https://github.com/apache/hadoop/pull/157
  
@steveloughran ok that makes sense. Thanks for the explanation. Let me know 
if you need any help with pulling the retry logic out.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165997#comment-16165997
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on the issue:

https://github.com/apache/hadoop/pull/157
  
1. I don't see why this needs to be blocked waiting for all of the '13786 
patch to go in
2. but the rename/throttling stuff: yes.

Parallel execution: we've got another transfer manager issuing COPY 
requests, so more HTTPS requests to an S3 bucket/shard. The more requests to a 
single shard, the likelier you are to hit failures. Now, I think the xfer 
manager does handle the 503 replies which come back, but the rest of the FS 
client doesn't



> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165701#comment-16165701
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on the issue:

https://github.com/apache/hadoop/pull/157
  
Updates:
* Moved the parallel rename logic into a dedicated class called 
`ParallelDirectoryRenamer`
* A few other bug fixes, the core logic remains the same

@steveloughran your last comment on HADOOP-13786 suggested you may move the 
retry logic out into a separate patch? Are you planning to do that? If not, do 
you think this patch requires waiting for all the work in HADOOP-13786 to be 
completed?

If there are concerns with retry behavior, we could also set the default 
value of the copy thread pool to be 1, that way this feature is essentially off 
by default.

Also what do you mean by "isn't going to be resilient to large copies where 
you are much more likely to hit parallel IO"? What parallel IO are you 
referring to?


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165696#comment-16165696
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791863
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -241,26 +242,17 @@ public StorageStatistics provide() {
 }
   });
 
-  int maxThreads = conf.getInt(MAX_THREADS, DEFAULT_MAX_THREADS);
-  if (maxThreads < 2) {
-LOG.warn(MAX_THREADS + " must be at least 2: forcing to 2.");
-maxThreads = 2;
-  }
+  int maxThreads = getMaxThreads(conf, MAX_THREADS, 
DEFAULT_MAX_THREADS);
   int totalTasks = intOption(conf,
   MAX_TOTAL_TASKS, DEFAULT_MAX_TOTAL_TASKS, 1);
   long keepAliveTime = longOption(conf, KEEPALIVE_TIME,
   DEFAULT_KEEPALIVE_TIME, 0);
+
--- End diff --

Fixed


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165698#comment-16165698
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791881
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -303,7 +296,37 @@ public StorageStatistics provide() {
 } catch (AmazonClientException e) {
   throw translateException("initializing ", new Path(name), e);
 }
+  }
+
+  private int getMaxThreads(Configuration conf, String maxThreadsKey, int 
defaultMaxThreads) {
+int maxThreads = conf.getInt(maxThreadsKey, defaultMaxThreads);
+if (maxThreads < 2) {
+  LOG.warn(maxThreadsKey + " must be at least 2: forcing to 2.");
+  maxThreads = 2;
+}
+return maxThreads;
+  }
+
+  private LazyTransferManager 
createLazyUploadTransferManager(Configuration conf) {
--- End diff --

Done


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165697#comment-16165697
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791871
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -303,7 +296,37 @@ public StorageStatistics provide() {
 } catch (AmazonClientException e) {
   throw translateException("initializing ", new Path(name), e);
 }
+  }
+
+  private int getMaxThreads(Configuration conf, String maxThreadsKey, int 
defaultMaxThreads) {
--- End diff --

Done


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165695#comment-16165695
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791767
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/CopyContext.java 
---
@@ -0,0 +1,34 @@
+package org.apache.hadoop.fs.s3a;
+
+import com.amazonaws.services.s3.transfer.Copy;
+
+class CopyContext {
+
+  private final Copy copy;
+  private final String srcKey;
+  private final String dstKey;
+  private final long length;
+
+  CopyContext(Copy copy, String srcKey, String dstKey, long length) {
+this.copy = copy;
+this.srcKey = srcKey;
+this.dstKey = dstKey;
+this.length = length;
+  }
+
+  Copy getCopy() {
+return copy;
+  }
+
+  String getSrcKey() {
+return srcKey;
+  }
+
+  String getDstKey() {
--- End diff --

Fixed


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165693#comment-16165693
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791721
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -891,50 +902,123 @@ private boolean innerRename(Path source, Path dest)
   }
 
   List keysToDelete = new 
ArrayList<>();
+  List dirKeysToDelete = new 
ArrayList<>();
   if (dstStatus != null && dstStatus.isEmptyDirectory() == 
Tristate.TRUE) {
 // delete unnecessary fake directory.
 keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
   }
 
-  Path parentPath = keyToPath(srcKey);
-  RemoteIterator iterator = 
listFilesAndEmptyDirectories(
-  parentPath, true);
-  while (iterator.hasNext()) {
-LocatedFileStatus status = iterator.next();
-long length = status.getLen();
-String key = pathToKey(status.getPath());
-if (status.isDirectory() && !key.endsWith("/")) {
-  key += "/";
-}
-keysToDelete
-.add(new DeleteObjectsRequest.KeyVersion(key));
-String newDstKey =
-dstKey + key.substring(srcKey.length());
-copyFile(key, newDstKey, length);
-
-if (hasMetadataStore()) {
-  // with a metadata store, the object entries need to be updated,
-  // including, potentially, the ancestors
-  Path childSrc = keyToQualifiedPath(key);
-  Path childDst = keyToQualifiedPath(newDstKey);
-  if (objectRepresentsDirectory(key, length)) {
-S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
-childDst, username);
+  // A blocking queue that tracks all objects that need to be deleted
+  BlockingQueue deleteQueue 
= new ArrayBlockingQueue<>(
+  (int) Math.round(MAX_ENTRIES_TO_DELETE * 1.5));
+
+  // Used to track if the delete thread was gracefully shutdown
+  boolean deleteFutureComplete = false;
+  FutureTask deleteFuture = null;
+
+  try {
+// Launch a thread that will read from the deleteQueue and batch 
delete any files that have already been copied
+deleteFuture = new FutureTask<>(() -> {
+  while (true) {
+while (keysToDelete.size() < MAX_ENTRIES_TO_DELETE) {
+  Optional key = 
deleteQueue.take();
+
+  // The thread runs until is is given an EOF message (an 
Optional#empty())
+  if (key.isPresent()) {
--- End diff --

I removed the usage of `Optional`. Using a `private static final 
DeleteObjectsRequest.KeyVersion END_OF_KEYS_TO_DELETE = new 
DeleteObjectsRequest.KeyVersion(null, null);` as the EOF instead.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165694#comment-16165694
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791756
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -891,50 +902,123 @@ private boolean innerRename(Path source, Path dest)
   }
 
   List keysToDelete = new 
ArrayList<>();
+  List dirKeysToDelete = new 
ArrayList<>();
   if (dstStatus != null && dstStatus.isEmptyDirectory() == 
Tristate.TRUE) {
 // delete unnecessary fake directory.
 keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
   }
 
-  Path parentPath = keyToPath(srcKey);
-  RemoteIterator iterator = 
listFilesAndEmptyDirectories(
-  parentPath, true);
-  while (iterator.hasNext()) {
-LocatedFileStatus status = iterator.next();
-long length = status.getLen();
-String key = pathToKey(status.getPath());
-if (status.isDirectory() && !key.endsWith("/")) {
-  key += "/";
-}
-keysToDelete
-.add(new DeleteObjectsRequest.KeyVersion(key));
-String newDstKey =
-dstKey + key.substring(srcKey.length());
-copyFile(key, newDstKey, length);
-
-if (hasMetadataStore()) {
-  // with a metadata store, the object entries need to be updated,
-  // including, potentially, the ancestors
-  Path childSrc = keyToQualifiedPath(key);
-  Path childDst = keyToQualifiedPath(newDstKey);
-  if (objectRepresentsDirectory(key, length)) {
-S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
-childDst, username);
+  // A blocking queue that tracks all objects that need to be deleted
+  BlockingQueue deleteQueue 
= new ArrayBlockingQueue<>(
+  (int) Math.round(MAX_ENTRIES_TO_DELETE * 1.5));
+
+  // Used to track if the delete thread was gracefully shutdown
+  boolean deleteFutureComplete = false;
+  FutureTask deleteFuture = null;
+
+  try {
+// Launch a thread that will read from the deleteQueue and batch 
delete any files that have already been copied
+deleteFuture = new FutureTask<>(() -> {
+  while (true) {
+while (keysToDelete.size() < MAX_ENTRIES_TO_DELETE) {
+  Optional key = 
deleteQueue.take();
+
+  // The thread runs until is is given an EOF message (an 
Optional#empty())
+  if (key.isPresent()) {
+keysToDelete.add(key.get());
+  } else {
+
+// Delete any remaining keys and exit
+removeKeys(keysToDelete, true, false);
+return null;
+  }
+}
+removeKeys(keysToDelete, true, false);
+  }
+});
+
+Thread deleteThread = new Thread(deleteFuture);
+deleteThread.setName("s3a-rename-delete-thread");
+deleteThread.start();
+
+// Used to abort future copy tasks as soon as one copy task fails
+AtomicBoolean copyFailure = new AtomicBoolean(false);
+List copies = new ArrayList<>();
+
+Path parentPath = keyToPath(srcKey);
+RemoteIterator iterator = 
listFilesAndEmptyDirectories(
+parentPath, true);
+while (iterator.hasNext()) {
+  LocatedFileStatus status = iterator.next();
+  long length = status.getLen();
+  String key = pathToKey(status.getPath());
+  if (status.isDirectory() && !key.endsWith("/")) {
+key += "/";
+  }
+  if (status.isDirectory()) {
+dirKeysToDelete.add(new DeleteObjectsRequest.KeyVersion(key));
+  }
+  String newDstKey =
+  dstKey + key.substring(srcKey.length());
+
+  // If no previous file hit a copy failure, copy this file
+  if (!copyFailure.get()) {
+copies.add(new CopyContext(copyFileAsync(key, newDstKey,
+new RenameProgressListener(this, srcStatus, 
status.isDirectory() ? null :
+new DeleteObjectsRequest.KeyVersion(key), 
deleteQueue, copyFailure)),
+key, newDstKey, length));
   } else {
-S3Guard.addMoveFile(metadataStore, srcPaths, dstMetas, 
childSrc,
-

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165689#comment-16165689
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791626
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/CopyContext.java 
---
@@ -0,0 +1,34 @@
+package org.apache.hadoop.fs.s3a;
--- End diff --

Whoops, always forget those. Fixed.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165690#comment-16165690
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138791635
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/LazyTransferManager.java
 ---
@@ -0,0 +1,63 @@
+package org.apache.hadoop.fs.s3a;
--- End diff --

Fixed


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164662#comment-16164662
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on the issue:

https://github.com/apache/hadoop/pull/157
  
In HADOOP-13786 I'm wrapping every single s3 client call with retry policy, 
then expanding the inconsistent client to generate more faults (initially 
throttle, later connection setup/response parsing). I'd really like this work 
to actually await that, as without it this code isn't going to be resilient to 
large copies where you are much more likely to hit parallel IO. And we need to 
make sure there's a good failure policy set up there


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164660#comment-16164660
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138620461
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -891,50 +902,123 @@ private boolean innerRename(Path source, Path dest)
   }
 
   List keysToDelete = new 
ArrayList<>();
+  List dirKeysToDelete = new 
ArrayList<>();
   if (dstStatus != null && dstStatus.isEmptyDirectory() == 
Tristate.TRUE) {
 // delete unnecessary fake directory.
 keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
   }
 
-  Path parentPath = keyToPath(srcKey);
-  RemoteIterator iterator = 
listFilesAndEmptyDirectories(
-  parentPath, true);
-  while (iterator.hasNext()) {
-LocatedFileStatus status = iterator.next();
-long length = status.getLen();
-String key = pathToKey(status.getPath());
-if (status.isDirectory() && !key.endsWith("/")) {
-  key += "/";
-}
-keysToDelete
-.add(new DeleteObjectsRequest.KeyVersion(key));
-String newDstKey =
-dstKey + key.substring(srcKey.length());
-copyFile(key, newDstKey, length);
-
-if (hasMetadataStore()) {
-  // with a metadata store, the object entries need to be updated,
-  // including, potentially, the ancestors
-  Path childSrc = keyToQualifiedPath(key);
-  Path childDst = keyToQualifiedPath(newDstKey);
-  if (objectRepresentsDirectory(key, length)) {
-S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
-childDst, username);
+  // A blocking queue that tracks all objects that need to be deleted
+  BlockingQueue deleteQueue 
= new ArrayBlockingQueue<>(
+  (int) Math.round(MAX_ENTRIES_TO_DELETE * 1.5));
+
+  // Used to track if the delete thread was gracefully shutdown
+  boolean deleteFutureComplete = false;
+  FutureTask deleteFuture = null;
+
+  try {
+// Launch a thread that will read from the deleteQueue and batch 
delete any files that have already been copied
+deleteFuture = new FutureTask<>(() -> {
+  while (true) {
+while (keysToDelete.size() < MAX_ENTRIES_TO_DELETE) {
+  Optional key = 
deleteQueue.take();
+
+  // The thread runs until is is given an EOF message (an 
Optional#empty())
+  if (key.isPresent()) {
+keysToDelete.add(key.get());
+  } else {
+
+// Delete any remaining keys and exit
+removeKeys(keysToDelete, true, false);
+return null;
+  }
+}
+removeKeys(keysToDelete, true, false);
+  }
+});
+
+Thread deleteThread = new Thread(deleteFuture);
+deleteThread.setName("s3a-rename-delete-thread");
+deleteThread.start();
+
+// Used to abort future copy tasks as soon as one copy task fails
+AtomicBoolean copyFailure = new AtomicBoolean(false);
+List copies = new ArrayList<>();
+
+Path parentPath = keyToPath(srcKey);
+RemoteIterator iterator = 
listFilesAndEmptyDirectories(
+parentPath, true);
+while (iterator.hasNext()) {
+  LocatedFileStatus status = iterator.next();
+  long length = status.getLen();
+  String key = pathToKey(status.getPath());
+  if (status.isDirectory() && !key.endsWith("/")) {
+key += "/";
+  }
+  if (status.isDirectory()) {
+dirKeysToDelete.add(new DeleteObjectsRequest.KeyVersion(key));
+  }
+  String newDstKey =
+  dstKey + key.substring(srcKey.length());
+
+  // If no previous file hit a copy failure, copy this file
+  if (!copyFailure.get()) {
+copies.add(new CopyContext(copyFileAsync(key, newDstKey,
+new RenameProgressListener(this, srcStatus, 
status.isDirectory() ? null :
+new DeleteObjectsRequest.KeyVersion(key), 
deleteQueue, copyFailure)),
+key, newDstKey, length));
   } else {
-S3Guard.addMoveFile(metadataStore, srcPaths, dstMetas, 
childSrc,
-

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164658#comment-16164658
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138620149
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -303,7 +296,37 @@ public StorageStatistics provide() {
 } catch (AmazonClientException e) {
   throw translateException("initializing ", new Path(name), e);
 }
+  }
+
+  private int getMaxThreads(Configuration conf, String maxThreadsKey, int 
defaultMaxThreads) {
+int maxThreads = conf.getInt(maxThreadsKey, defaultMaxThreads);
+if (maxThreads < 2) {
+  LOG.warn(maxThreadsKey + " must be at least 2: forcing to 2.");
+  maxThreads = 2;
+}
+return maxThreads;
+  }
+
+  private LazyTransferManager 
createLazyUploadTransferManager(Configuration conf) {
--- End diff --

this logic could be moved into LazyUploadTransferManager itself, maybe just 
as a static method


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164656#comment-16164656
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138619558
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -241,26 +242,17 @@ public StorageStatistics provide() {
 }
   });
 
-  int maxThreads = conf.getInt(MAX_THREADS, DEFAULT_MAX_THREADS);
-  if (maxThreads < 2) {
-LOG.warn(MAX_THREADS + " must be at least 2: forcing to 2.");
-maxThreads = 2;
-  }
+  int maxThreads = getMaxThreads(conf, MAX_THREADS, 
DEFAULT_MAX_THREADS);
   int totalTasks = intOption(conf,
   MAX_TOTAL_TASKS, DEFAULT_MAX_TOTAL_TASKS, 1);
   long keepAliveTime = longOption(conf, KEEPALIVE_TIME,
   DEFAULT_KEEPALIVE_TIME, 0);
+
--- End diff --

ideally, keep superfluous edits to a min. I know, we all abuse that...but 
its good to try


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164657#comment-16164657
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138619796
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -303,7 +296,37 @@ public StorageStatistics provide() {
 } catch (AmazonClientException e) {
   throw translateException("initializing ", new Path(name), e);
 }
+  }
+
+  private int getMaxThreads(Configuration conf, String maxThreadsKey, int 
defaultMaxThreads) {
--- End diff --

move to S3AUtils


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164655#comment-16164655
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138619429
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -241,26 +242,17 @@ public StorageStatistics provide() {
 }
   });
 
-  int maxThreads = conf.getInt(MAX_THREADS, DEFAULT_MAX_THREADS);
-  if (maxThreads < 2) {
-LOG.warn(MAX_THREADS + " must be at least 2: forcing to 2.");
-maxThreads = 2;
-  }
+  int maxThreads = getMaxThreads(conf, MAX_THREADS, 
DEFAULT_MAX_THREADS);
--- End diff --

I'm assuming the checks are here for a reason. Unless lazy xfer manager 
does the uprate, it'll need reinstatement. Doing it in the manager would be best


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164651#comment-16164651
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138618823
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/CopyContext.java 
---
@@ -0,0 +1,34 @@
+package org.apache.hadoop.fs.s3a;
+
+import com.amazonaws.services.s3.transfer.Copy;
+
+class CopyContext {
+
+  private final Copy copy;
+  private final String srcKey;
+  private final String dstKey;
+  private final long length;
+
+  CopyContext(Copy copy, String srcKey, String dstKey, long length) {
+this.copy = copy;
+this.srcKey = srcKey;
+this.dstKey = dstKey;
+this.length = length;
+  }
+
+  Copy getCopy() {
+return copy;
+  }
+
+  String getSrcKey() {
+return srcKey;
+  }
+
+  String getDstKey() {
--- End diff --

no, go on, use "destKey". We can afford the letter e's


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164647#comment-16164647
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r138618577
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -891,50 +902,123 @@ private boolean innerRename(Path source, Path dest)
   }
 
   List keysToDelete = new 
ArrayList<>();
+  List dirKeysToDelete = new 
ArrayList<>();
   if (dstStatus != null && dstStatus.isEmptyDirectory() == 
Tristate.TRUE) {
 // delete unnecessary fake directory.
 keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
   }
 
-  Path parentPath = keyToPath(srcKey);
-  RemoteIterator iterator = 
listFilesAndEmptyDirectories(
-  parentPath, true);
-  while (iterator.hasNext()) {
-LocatedFileStatus status = iterator.next();
-long length = status.getLen();
-String key = pathToKey(status.getPath());
-if (status.isDirectory() && !key.endsWith("/")) {
-  key += "/";
-}
-keysToDelete
-.add(new DeleteObjectsRequest.KeyVersion(key));
-String newDstKey =
-dstKey + key.substring(srcKey.length());
-copyFile(key, newDstKey, length);
-
-if (hasMetadataStore()) {
-  // with a metadata store, the object entries need to be updated,
-  // including, potentially, the ancestors
-  Path childSrc = keyToQualifiedPath(key);
-  Path childDst = keyToQualifiedPath(newDstKey);
-  if (objectRepresentsDirectory(key, length)) {
-S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
-childDst, username);
+  // A blocking queue that tracks all objects that need to be deleted
+  BlockingQueue deleteQueue 
= new ArrayBlockingQueue<>(
+  (int) Math.round(MAX_ENTRIES_TO_DELETE * 1.5));
+
+  // Used to track if the delete thread was gracefully shutdown
+  boolean deleteFutureComplete = false;
+  FutureTask deleteFuture = null;
+
+  try {
+// Launch a thread that will read from the deleteQueue and batch 
delete any files that have already been copied
+deleteFuture = new FutureTask<>(() -> {
+  while (true) {
+while (keysToDelete.size() < MAX_ENTRIES_TO_DELETE) {
+  Optional key = 
deleteQueue.take();
+
+  // The thread runs until is is given an EOF message (an 
Optional#empty())
+  if (key.isPresent()) {
--- End diff --

I'm making the leap to Java 8 with the committer and retry logic. Doesn't 
mean we should jump to using Optional just because we can. Put differently: if 
you are going to just use it as a fancy null, it's overkill. Key benefit is 
that you and use map & foreach, but there we are crippled by java's checked 
exceptions stopping us throwing IOEs


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-12 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163364#comment-16163364
 ] 

Hadoop QA commented on HADOOP-13600:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 11s{color} 
| {color:red} HADOOP-13600 does not apply to branch-2. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HADOOP-13600 |
| GITHUB PR | https://github.com/apache/hadoop/pull/167 |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/13264/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-12 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163348#comment-16163348
 ] 

Sahil Takiar commented on HADOOP-13600:
---

[~ste...@apache.org] thanks for taking a look!

* Thanks for the tip with the S3Guard testing, I re-ran the tests with the 
local dynamodb enabled, tests pass
* I made some modifications so that all directories are deleted after the 
files, and they are deleted in the order returned by 
{{#listFilesAndEmptyDirectories}}
* I tried to move some stuff out into separate classes to decrease the # of 
lines
* I creates a new class called `LazyTransferManager` that lazily initializes 
the `TransferManager`

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
> Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160040#comment-16160040
 ] 

Steve Loughran commented on HADOOP-13600:
-

Just had a quick look at things other than the core algorithm, which will take 
more time to review than I have now

* With s3guard we want to make sure that parent things are only deleted after 
all the children; that may imply a serialized bit at the end.
* we're permanently fighting file size in S3AFilesystem, which is at 3K lines 
today; anything which can be done to organise stuff out of the class (progress, 
common thread pool setup) helps.
* And resource use and startup time of s3a is an issue on thngs like shared 
hive daemons, where an s3a instance for a single user might be instantiated for 
a few reads and then deleted...we shouldn't be creating heavier things until 
needed.

Maybe here we could have a transfer manager wrapper which implemented on-demand 
creation, reading in the config options

{code}
XferManager transfers = new XferManager(conf, TRANSFERS, TRANSFERS_MAX)
XferManager upload = new XferManager(conf, UPLOADS, UPLOADS_MAX)

// and with the .get() method doing on-demand instantiation of things including 
thread pools as needed.
transfers.get().startTransfer(...)
{code}

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160038#comment-16160038
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931682
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -2167,7 +2280,34 @@ public String getCanonicalServiceName() {
* @throws IOException Other IO problems
*/
   private void copyFile(String srcKey, String dstKey, long size)
-  throws IOException, InterruptedIOException, AmazonClientException {
+  throws IOException, AmazonClientException {
--- End diff --

leave as before so as to make clear it may be interrupted, unless its 
mentioned in the javadocs


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160037#comment-16160037
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931623
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -143,9 +150,11 @@
   private Listing listing;
   private long partSize;
   private boolean enableMultiObjectsDelete;
-  private TransferManager transfers;
+  private TransferManager uploads;
--- End diff --

leave the name of this alone unless really, really needed because (a) we 
may do other things with it in future, (b) reduces the size of the diff hence 
compatibility with other patches.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160033#comment-16160033
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931556
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -896,45 +941,108 @@ private boolean innerRename(Path source, Path dest)
 keysToDelete.add(new DeleteObjectsRequest.KeyVersion(dstKey));
   }
 
-  Path parentPath = keyToPath(srcKey);
-  RemoteIterator iterator = 
listFilesAndEmptyDirectories(
-  parentPath, true);
-  while (iterator.hasNext()) {
-LocatedFileStatus status = iterator.next();
-long length = status.getLen();
-String key = pathToKey(status.getPath());
-if (status.isDirectory() && !key.endsWith("/")) {
-  key += "/";
-}
-keysToDelete
-.add(new DeleteObjectsRequest.KeyVersion(key));
-String newDstKey =
-dstKey + key.substring(srcKey.length());
-copyFile(key, newDstKey, length);
-
-if (hasMetadataStore()) {
-  // with a metadata store, the object entries need to be updated,
-  // including, potentially, the ancestors
-  Path childSrc = keyToQualifiedPath(key);
-  Path childDst = keyToQualifiedPath(newDstKey);
-  if (objectRepresentsDirectory(key, length)) {
-S3Guard.addMoveDir(metadataStore, srcPaths, dstMetas, childSrc,
-childDst, username);
+  // A blocking queue that tracks all objects that need to be deleted
+  BlockingQueue deleteQueue 
= new ArrayBlockingQueue<>(
+  MAX_ENTRIES_TO_DELETE * 1.5);
+
+  // Used to track if the delete thread was gracefully shutdown
+  boolean deleteFutureComplete = false;
+  FutureTask deleteFuture = null;
+
+  try {
+// Launch a thread that will read from the deleteQueue and batch 
delete any files that have already been copied
+deleteFuture = new FutureTask<>(new Callable() {
--- End diff --

If we target java 8 only, this can be a proper lambda-expression :)


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160030#comment-16160030
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931539
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -306,6 +311,58 @@ public StorageStatistics provide() {
 
   }
 
+  private TransferManager createUploadTransferManager(Configuration conf) {
+int maxThreads = conf.getInt(UPLOAD_MAX_THREADS, 
DEFAULT_UPLOAD_MAX_THREADS);
+if (maxThreads < 2) {
+  LOG.warn(UPLOAD_MAX_THREADS + " must be at least 2: forcing to 2.");
+  maxThreads = 2;
+}
+long keepAliveTime = longOption(conf, KEEPALIVE_TIME,
+DEFAULT_KEEPALIVE_TIME, 0);
+
+uploadUnboundedThreadPool = new ThreadPoolExecutor(
+maxThreads, Integer.MAX_VALUE,
+keepAliveTime, TimeUnit.SECONDS,
+new LinkedBlockingQueue(),
+BlockingThreadPoolExecutorService.newDaemonThreadFactory(
+"s3a-upload-unbounded"));
+
+TransferManagerConfiguration transferConfiguration =
+new TransferManagerConfiguration();
+transferConfiguration.setMinimumUploadPartSize(partSize);
+transferConfiguration.setMultipartUploadThreshold(multiPartThreshold);
+
+TransferManager uploads = new TransferManager(s3, 
uploadUnboundedThreadPool);
+uploads.setConfiguration(transferConfiguration);
+return uploads;
+  }
+
+  private TransferManager createCopyTransferManager(Configuration conf) {
--- End diff --

this is almost the same as the previous one apart from the conf fields and 
names; should be a common method


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160028#comment-16160028
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931515
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -269,7 +273,8 @@ public StorageStatistics provide() {
   }
   useListV1 = (listVersion == 1);
 
-  initTransferManager();
+  this.uploads = createUploadTransferManager(conf);
+  this.copies = createCopyTransferManager(conf);
--- End diff --

I'd actually like this to be on demand. That way, if not needed, no 
overhead of creating it. init times of s3a are already high


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160027#comment-16160027
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r137931483
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java 
---
@@ -107,11 +107,15 @@ private Constants() {
   public static final String MAX_PAGING_KEYS = "fs.s3a.paging.maximum";
   public static final int DEFAULT_MAX_PAGING_KEYS = 5000;
 
-  // the maximum number of threads to allow in the pool used by 
TransferManager
-  public static final String MAX_THREADS = "fs.s3a.threads.max";
-  public static final int DEFAULT_MAX_THREADS = 10;
+  // the maximum number of threads to allow in the pool used by 
TransferManager for uploads
+  public static final String UPLOAD_MAX_THREADS = "fs.s3a.threads.max";
+  public static final int DEFAULT_UPLOAD_MAX_THREADS = 10;
--- End diff --

not going to rename constants because it will break anyone using them. 
Class is tagged {@InterfaceAudience.Public/evolving}


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160024#comment-16160024
 ] 

Steve Loughran commented on HADOOP-13600:
-

you don't need dynamo db account for the s3guard tests, just run with 
{{-Ds3guard -Ddynamodblocal}}

modifying that existing scale test would be the way to go for parallelism

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2017-09-08 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159377#comment-16159377
 ] 

Sahil Takiar commented on HADOOP-13600:
---

Rebased this patch and made some bug fixes:
* Details of how the code work are in the PR: 
https://github.com/apache/hadoop/pull/157
* Ran the existing tests (except for the S3Guard ITests because I need to get 
DynamoDb access)
** Ran the unit tests, itests, and scale tests and they all pass
** There seems to already be an existing scale test that stresses this part of 
the code {{ITestS3ADeleteManyFiles#testBulkRenameAndDelete}} creates a bunch of 
files and then renames them
* I haven't written other ITests because wanted to get some input on whether 
additional ones are necessary
* Planning to add some unit tests (using lots of mocks) once I get a thumbs up 
on the overall approach

[~fabbri], [~mackrorysd] could you take a look?

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754712#comment-15754712
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user thodemoor commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r92831039
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java 
---
@@ -99,17 +99,25 @@ private Constants() {
   public static final String MAX_PAGING_KEYS = "fs.s3a.paging.maximum";
   public static final int DEFAULT_MAX_PAGING_KEYS = 5000;
 
-  // the maximum number of threads to allow in the pool used by 
TransferManager
-  public static final String MAX_THREADS = "fs.s3a.threads.max";
-  public static final int DEFAULT_MAX_THREADS = 10;
+  // the maximum number of threads to allow in the pool used by 
TransferManager for uploads
+  public static final String UPLOAD_MAX_THREADS = "fs.s3a.threads.max";
+  public static final int UPLOAD_DEFAULT_MAX_THREADS = 10;
+
+  // the maximum number of threads to allow in the pool used by 
TransferManager for copies
+  public static final String COPY_MAX_THREADS = "fs.s3a.threads.max";
--- End diff --

fs.s3a.copy.threads.max?


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-12-15 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752159#comment-15752159
 ] 

Sahil Takiar commented on HADOOP-13600:
---

[~ste...@apache.org], [~thodemoor]

Do you both have some more comments on the PR? 
https://github.com/apache/hadoop/pull/157

* What do you think of the approach I am taking? Anything missing?
* I started writing unit tests for {{S3AFileSystem.rename}}, which turned out 
to be much more difficult that I thought; but I was able to get a basic unit 
test completed. Planning to add some more testing, but wanted to make sure the 
approach was correct first.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-12-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752152#comment-15752152
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user sahilTakiar commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r92673062
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -207,11 +218,17 @@ public StorageStatistics provide() {
   MAX_TOTAL_TASKS, DEFAULT_MAX_TOTAL_TASKS, 1);
   long keepAliveTime = longOption(conf, KEEPALIVE_TIME,
   DEFAULT_KEEPALIVE_TIME, 0);
-  threadPoolExecutor = BlockingThreadPoolExecutorService.newInstance(
+  uploadThreadPoolExecutor = 
BlockingThreadPoolExecutorService.newInstance(
   maxThreads,
   maxThreads + totalTasks,
   keepAliveTime, TimeUnit.SECONDS,
-  "s3a-transfer-shared");
+  "s3a-upload-shared");
+
+  copyThreadPoolExecutor = 
BlockingThreadPoolExecutorService.newInstance(
+  maxThreads,
--- End diff --

How about the core pool size? I've updated both for now. Wasn't sure what 
reasonable defaults would be, so I just put down by best guess.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-12-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15743379#comment-15743379
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user thodemoor commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r92054836
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -207,11 +218,17 @@ public StorageStatistics provide() {
   MAX_TOTAL_TASKS, DEFAULT_MAX_TOTAL_TASKS, 1);
   long keepAliveTime = longOption(conf, KEEPALIVE_TIME,
   DEFAULT_KEEPALIVE_TIME, 0);
-  threadPoolExecutor = BlockingThreadPoolExecutorService.newInstance(
+  uploadThreadPoolExecutor = 
BlockingThreadPoolExecutorService.newInstance(
   maxThreads,
   maxThreads + totalTasks,
   keepAliveTime, TimeUnit.SECONDS,
-  "s3a-transfer-shared");
+  "s3a-upload-shared");
+
+  copyThreadPoolExecutor = 
BlockingThreadPoolExecutorService.newInstance(
+  maxThreads,
--- End diff --

COPY is server-side (no data transfer) and is thus generally much less 
resource-intensive and much quicker than PUT (the smaller your bandwidth to S3, 
the bigger the difference becomes). So I think the `maxThreads` for the 
copyThreadpool could be (much) higher than for uploadThreadpool and should thus 
be configurable separately.


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708259#comment-15708259
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran closed the pull request at:

https://github.com/apache/hadoop/pull/167


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-29 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707670#comment-15707670
 ] 

Sahil Takiar commented on HADOOP-13600:
---

Sounds good Steve. I rebased the PR and created a separate TransferManager just 
for copies. Haven't looked at the testing yet.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706370#comment-15706370
 ] 

Steve Loughran commented on HADOOP-13600:
-

sahil, unless you object, I'm assigning this to you; I'll help with review & 
test. I do like the idea of fault injection via a new s3 client though.

BTW, filed HADOOP-13846; implementing the {{FileSystem.rename(final Path src, 
final Path dst, final Rename... options)}} method. That can come after; I'm 
just thinking it may be the way to expose exception raising in rename() 
operations.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Sahil Takiar
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705228#comment-15705228
 ] 

Steve Loughran commented on HADOOP-13600:
-

yeah, looking at my patch I'm not sure if there's much of it worth preserving. 
I'd realised it had got over complex. How about you sync your patch up with 
HADOOP-13823 and we start from there. Elsewhere (HADOOP-13786) I'm doing a 
special committer for s3, which is going to directly call innerRename() & so 
relay failures up immediately.

Failures on renames is going to be fun;
#  I can think of a test for that but it will be fairly brittle: create two 
large files, have a thread size of 1 for the copies, delete the 2nd while the 
first copy is in progress but after the listing has started. Or: start playing 
with mocks via the plug in AWS client code; have something which explicitly 
triggers failures on copy requests.
# I do want at least the first copy failure to be propagated up to the caller; 
usually it is that first failure which is most interesting.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-28 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704351#comment-15704351
 ] 

Sahil Takiar commented on HADOOP-13600:
---

Will take a look at HADOOP-13823

Addressed a few of the comments:

* Using a {{BlockingQueue}} to track keys that need to be deleted
* A separate thread takes from the queue until it taken 
{{MAX_ENTRIES_TO_DELETE}} keys, and then it issues the DELETE request
* An {{AtomicBoolean}} is passed into the {{ProgressListener}} of the COPY 
request, if the COPY fails the boolean is set to false, in which case no more 
COPY requests will be issued
* [~ste...@apache.org] I took a look at your PR, is it necessary to have a 
threadpool where each thread calls {{Copy.waitForCopyResult()}}; would it be 
simpler to create a separate {{TransferManager}} just for COPY requests

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702191#comment-15702191
 ] 

Steve Loughran commented on HADOOP-13600:
-

@stakier -could you take a look at HADOOP-13823 and see what you think? That's 
where I've got explicit exceptions being raised on rename failures, rather than 
catch + log + return 0.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702187#comment-15702187
 ] 

Steve Loughran commented on HADOOP-13600:
-

# S3A instrumentation should a gauge of pending copy transfers; it'd be 
incremented during queue submit, decremented on success/failure of the copy. If 
there was a separate "active copy" gauge we could even distinguish 
copy-in-progress for copy-waiting-for threads.
# We could also actually include the size of the file being copied, as it will 
come from the list/getFileStatus calls. I think I'd like to see more debug 
level logging too; maybe something in innerRename() to actually log the entire 
duration and effective bandwidth of the call. I'd certainly like to know that.
# the package scoped inner rename mentioned above could also benefit from 
knowing file count, total size of the rename. It may want to log that at INFO, 
irrespective of what S3A does. Why? Answers that support call "why does the 
committer take so long at the end"?

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702172#comment-15702172
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/157#discussion_r89797606
  
--- Diff: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
 ---
@@ -2288,4 +2316,35 @@ public String toString() {
 }
   }
 
+  /**
+   * A {@link ProgressListener} for renames. When the transfer completes, 
the listener will delete the source key and
+   * update any relevant statistics.
+   */
+  private class RenameProgressListener implements ProgressListener {
+
+private final S3AFileStatus srcStatus;
+
+private RenameProgressListener(S3AFileStatus srcStatus) {
+  this.srcStatus = srcStatus;
+}
+
+@Override
+public void progressChanged(ProgressEvent progressEvent) {
+  switch (progressEvent.getEventType()) {
+case TRANSFER_PART_COMPLETED_EVENT:
+  incrementWriteOperations();
+  break;
+case TRANSFER_COMPLETED_EVENT:
+  try {
+innerDelete(srcStatus, false);
--- End diff --

if we add some threads for low priority delete/create calls, then this can 
be done in the background. It'd also change how failures are handed: delete 
won't fail; no need to catch IOEs or AmazonClientExceptions. However, I'd pool 
the delete operations to avoid throttling problems


> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15693551#comment-15693551
 ] 

Steve Loughran commented on HADOOP-13600:
-

I've just stuck up my nov 10 patch as a PR too. Mine is WiP; I'd stopped as I'd 
got to a point where it was too complex. Now we know of the thread pool 
problems in the TransferManager we'll have to look at both.

What I have pulled out from mine, which I want to get in first, is having 
{{innerRename()}} raise meaningful exceptions, rather than just return 
true/false with no details whatsoever. That's part of HADOOP-13823.

Why? So that when I do an S3A-specific committer, it can stop having to deal 
with the ambiguity of rename returning false, and instead fail with meaningful 
messages. When I start that I'd make {{innerRename()}} package public, or add 
some new @Private scoped void renameStrict() call *which would also take some 
progressable callback*.

Looking at your code, I like how you rely on AWS callbacks to trigger deletes. 
However, it's nice to be able to pool those to avoid throttling from too many 
requests; that could be done by (optionally) building up the list and only 
triggering a delete when a threshold was reached.

What your patch doesn't have —and I'd planned to but not done myself— was do 
async listing. If we retain the current "process 5000, list and repeat" 
strategy then the list creates a bottleneck, as may the waiting for all entries 
in a single batch to complete (I'm not sure how much that situation arises, it 
would if there was, say, a 4GB file and lots of other small files in the tree; 
you could block for that 4GB file even while there is more to copy).

That we could maybe ignore. Other issues

# failures. If one copy fails, we'd want to not submit any more, even if 
ongoing work is still completed
# queue saturation. Unless there's a separate rename thread pool (my strategy), 
we should have some blocking queue of pending copies. Why? Stops all other IO 
blocking just because one thread submitted 5000 copy operations.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-24 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15693525#comment-15693525
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

GitHub user steveloughran opened a pull request:

https://github.com/apache/hadoop/pull/167

HADOOP-13600 

starting on parallel rename, still designing code for max parallelism. Even 
listing and delete calls should be in parallel threads. Really only need to be 
collecting at the same rate as copies, which is implicitly defined by the rate 
of keys added to a delete queue


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/steveloughran/hadoop s3/HADOOOP-13600-rename

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hadoop/pull/167.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #167


commit 00a0b79481cced4def8734f1aadfb94ef315d737
Author: Steve Loughran 
Date:   2016-11-10T10:26:34Z

HADOOP-13600 starting on parallel rename, still designing code for max 
parallelism. Even listing and delete calls should be in parallel threads. 
Indeed: listing could consider doing a pre-emptive call to grab all of the 
list, though for a bucket with a few million files this would be too expensive. 
Really only need to be collecting at the same rate as copies, which is 
implicitly defined by the rate of keys added to a delete queue

Change-Id: I906a1a15f3a7567cbff1999236549627859319a5




> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-10 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654993#comment-15654993
 ] 

Sahil Takiar commented on HADOOP-13600:
---

[~ste...@apache.org] I created a Pull Request: 
https://github.com/apache/hadoop/pull/157

Let me know what you think of my approach. I verified that the the S3 unit 
tests pass, but have not run the integration tests yet.

The patch is pretty simple, but its different from the approach you outlined in 
HIVE-15093. Below are some notes:

* A new method called {{copyFileAsync}} was added which returns a {{Copy}} 
object, the original method {{copyFile}} is still there but it just invokes 
{{copyFileAsync(...).waitForCopyResult()}}
* Deletes are done inside the {{ProgressListener}}, I removed the logic in 
{{rename(...)}} that issues bulk delete requests
** I'm assuming the {{ProgressListener}} is invoked by the same thread that is 
issuing the copy request (correct me if I am wrong)
** The drawback is that more calls to S3 are made since delete ops aren't 
grouped together, but the advantage is that deletes are now done across 
multiple threads
*** Let me know if you think this scales. Another benefit of my approach is 
that the logic is much simpler. If we need bulk delete ops then some type of 
intermediate blocking queue may be necessary
* I'm not entirely sure how to make the listing sequential, the API seems to 
suggest you have to sequentially call {{listNextBatchOfObjects(...)}}

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-11-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654967#comment-15654967
 ] 

ASF GitHub Bot commented on HADOOP-13600:
-

GitHub user sahilTakiar opened a pull request:

https://github.com/apache/hadoop/pull/157

HADOOP-13600. S3a rename() to copy files in a directory in parallel



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sahilTakiar/hadoop HADOOP-13600

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hadoop/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #157


commit 1d659cba642b381fd98b9fb3a2443ee9d2cab2fa
Author: Sahil Takiar 
Date:   2016-11-10T19:27:24Z

HADOOP-13600. S3a rename() to copy files in a directory in parallel




> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-09-20 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506465#comment-15506465
 ] 

Steve Loughran commented on HADOOP-13600:
-

If we do this using the same thread pool as for block uploads, then some 
priority queuing should be used for the renames, so that they get priority over 
uploads, the latter being much slower.

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

2016-09-14 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490020#comment-15490020
 ] 

Steve Loughran commented on HADOOP-13600:
-

Rather than naively issuing the copy calls in the order the list came back, we 
should sort them in file size.

why? Assuming there is thread capacity, it means the largest files would all be 
copied simultaneously; if some are smaller then after they complete the next 
copies could start, while the biggest copy was still ongoing.


This would be faster than a list-ordered approach if the list contained a mix 
of long and short blobs

> S3a rename() to copy files in a directory in parallel
> -
>
> Key: HADOOP-13600
> URL: https://issues.apache.org/jira/browse/HADOOP-13600
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.7.3
>Reporter: Steve Loughran
>
> Currently a directory rename does a one-by-one copy, making the request 
> O(files * data). If the copy operations were launched in parallel, the 
> duration of the copy may be reducable to the duration of the longest copy. 
> For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

54 matches

Mail list logo