[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-21 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13145:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1

applied the branch-2.8 patch, tested, all is well.

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HADOOP-13145-branch-2.004.patch, 
> HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, 
> HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-20 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Attachment: HADOOP-13145-branch-2.8.004.patch

Steve, thank you for catching the branch-2.8 problem and reverting.  Sorry I 
didn't catch it myself earlier.

I'm attaching a branch-2.8 patch.  {{GenericTestUtils#getTestDir}} was 
introduced in your HADOOP-12984 patch, targeted to 2.9.0.  That's a sizable 
patch, and I don't want to take on a back-port right now.  Instead, this 
branch-2.8 patch goes back to the pre-HADOOP-12984 strategy of individual tests 
reading the {{test.build.data}} property directly.

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HADOOP-13145-branch-2.004.patch, 
> HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, 
> HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-20 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Status: Patch Available  (was: Reopened)

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HADOOP-13145-branch-2.004.patch, 
> HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, 
> HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-20 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13145:

   Resolution: Fixed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

+1

latest patch brings test time down to <60s including all JUnit overhead.

Thanks for doing this Chris, especially the tests. They'll be a good bit of 
regression testing in future

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HADOOP-13145-branch-2.004.patch, HADOOP-13145.001.patch, 
> HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-20 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Attachment: HADOOP-13145-branch-2.004.patch

I'm attaching patch v004.
* Removed redundant single-file tests and small multi-file tests.
* Introduced {{scale.test.distcp.file.size.kb}} configuration property for 
tuning test file sizes.  The default is 10 MB.
* Set multi-part configuration properties to 8 MB, so with the default 10 MB 
file size, the tests will cover multi-part upload.

With this version of the patch, the S3A test runs in ~55 seconds for me, and 
the WASB test runs in ~65 seconds.  I completed a full parallel-test run 
against S3 buckets in US-west-2.

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145-branch-2.004.patch, HADOOP-13145.001.patch, 
> HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-18 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13145:

Status: Patch Available  (was: Open)

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Attachment: HADOOP-13145.003.patch

Patch v003 adds a new abstract contract test suite for DistCp coverage and 
concrete test suite subclasses for S3A and WASB.  I verified the tests are 
passing for both hadoop-aws (including running in parallel mode) and 
hadoop-azure.

I'm going to leave the JIRA issue in Open status instead of Patch Available for 
now.  The v003 patch will potentially hit Jenkins a little hard because of 
touching multiple modules, so I'd like to get another round of code review 
feedback first.


> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Status: Open  (was: Patch Available)

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Status: Patch Available  (was: Open)

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.

2016-05-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HADOOP-13145:
---
Attachment: HADOOP-13145.001.patch

The attached v001 patch avoids the unnecessary {{getFileStatus}} call.

The effect is particularly pronounced when running DistCp with a destination on 
S3A, where eventual consistency on S3 can cause the {{getFileStatus}} call to 
fail with {{FileNotFoundException}}.  Then, the whole MapReduce task fails, 
retries, and repeats copying all the data.  [~rajesh.balamohan], I know you saw 
this with some recent large copies to S3A.  Would you be interested in trying a 
test with this patch?  So far, I don't have my own repro.  Note that this patch 
is only helpful as long as the DistCp command is not preserving metadata 
attributes, so don't use the {{-p}} option.

Cc [~ste...@apache.org].

> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> ---
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org