[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated HADOOP-13145: Resolution: Fixed Status: Resolved (was: Patch Available) +1 applied the branch-2.8 patch, tested, all is well. > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: HADOOP-13145-branch-2.004.patch, > HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, > HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Attachment: HADOOP-13145-branch-2.8.004.patch Steve, thank you for catching the branch-2.8 problem and reverting. Sorry I didn't catch it myself earlier. I'm attaching a branch-2.8 patch. {{GenericTestUtils#getTestDir}} was introduced in your HADOOP-12984 patch, targeted to 2.9.0. That's a sizable patch, and I don't want to take on a back-port right now. Instead, this branch-2.8 patch goes back to the pre-HADOOP-12984 strategy of individual tests reading the {{test.build.data}} property directly. > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: HADOOP-13145-branch-2.004.patch, > HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, > HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Status: Patch Available (was: Reopened) > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: HADOOP-13145-branch-2.004.patch, > HADOOP-13145-branch-2.8.004.patch, HADOOP-13145.001.patch, > HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated HADOOP-13145: Resolution: Fixed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) +1 latest patch brings test time down to <60s including all JUnit overhead. Thanks for doing this Chris, especially the tests. They'll be a good bit of regression testing in future > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: HADOOP-13145-branch-2.004.patch, HADOOP-13145.001.patch, > HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Attachment: HADOOP-13145-branch-2.004.patch I'm attaching patch v004. * Removed redundant single-file tests and small multi-file tests. * Introduced {{scale.test.distcp.file.size.kb}} configuration property for tuning test file sizes. The default is 10 MB. * Set multi-part configuration properties to 8 MB, so with the default 10 MB file size, the tests will cover multi-part upload. With this version of the patch, the S3A test runs in ~55 seconds for me, and the WASB test runs in ~65 seconds. I completed a full parallel-test run against S3 buckets in US-west-2. > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145-branch-2.004.patch, HADOOP-13145.001.patch, > HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated HADOOP-13145: Status: Patch Available (was: Open) > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Attachment: HADOOP-13145.003.patch Patch v003 adds a new abstract contract test suite for DistCp coverage and concrete test suite subclasses for S3A and WASB. I verified the tests are passing for both hadoop-aws (including running in parallel mode) and hadoop-azure. I'm going to leave the JIRA issue in Open status instead of Patch Available for now. The v003 patch will potentially hit Jenkins a little hard because of touching multiple modules, so I'd like to get another round of code review feedback first. > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Status: Open (was: Patch Available) > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145.001.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Status: Patch Available (was: Open) > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145.001.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13145) In DistCp, prevent unnecessary getFileStatus call when not preserving metadata.
[ https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HADOOP-13145: --- Attachment: HADOOP-13145.001.patch The attached v001 patch avoids the unnecessary {{getFileStatus}} call. The effect is particularly pronounced when running DistCp with a destination on S3A, where eventual consistency on S3 can cause the {{getFileStatus}} call to fail with {{FileNotFoundException}}. Then, the whole MapReduce task fails, retries, and repeats copying all the data. [~rajesh.balamohan], I know you saw this with some recent large copies to S3A. Would you be interested in trying a test with this patch? So far, I don't have my own repro. Note that this patch is only helpful as long as the DistCp command is not preserving metadata attributes, so don't use the {{-p}} option. Cc [~ste...@apache.org]. > In DistCp, prevent unnecessary getFileStatus call when not preserving > metadata. > --- > > Key: HADOOP-13145 > URL: https://issues.apache.org/jira/browse/HADOOP-13145 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: HADOOP-13145.001.patch > > > After DistCp copies a file, it calls {{getFileStatus}} to get the > {{FileStatus}} from the destination so that it can compare to the source and > update metadata if necessary. If the DistCp command was run without the > option to preserve metadata attributes, then this additional > {{getFileStatus}} call is wasteful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org