[
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290079#comment-15290079
]
Chris Nauroth commented on HADOOP-13145:
----------------------------------------
Interestingly, you're getting a much slower run than me for S3A and a much
faster run than me for WASB. I'm in the US Pacific Northwest. My S3 bucket is
in US-west-2. My Azure Storage account is in West US.
{code}
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.389 sec -
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Running org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 143.99 sec - in
org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp
{code}
bq. Could it be made one of the scaleable tests where it takes a config of
option on scale so can be made configurable?
We definitely could do that, but in my test runs, the large file tests don't
show a significantly longer execution time. (See below for my timings.) Are
the large file tests a long haul in your environment?
Maybe a more effective change would be to cut down the number of test cases. I
could keep just {{deepDirectoryStructureToRemote}}, {{largeFilesToRemote}},
{{deepDirectoryStructureFromRemote}} and {{largeFilesFromRemote}}. If I do
that, then my S3A execution time comes down to 90 seconds. I don't think it
sacrifices much in terms of coverage.
Let me know your thoughts, and then I'll update the patch.
{code}
<testcase name="multipleFilesToRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="22.084"/>
<testcase name="deepDirectoryStructureFromRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="12.973"/>
<testcase name="deepDirectoryStructureToRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="27.658"/>
<testcase name="largeFilesToRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="26.381"/>
<testcase name="singleFileToRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="12.197"/>
<testcase name="largeFilesFromRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="18.894"/>
<testcase name="multipleFilesFromRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="9.822"/>
<testcase name="singleFileFromRemote"
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp"
time="6.835"/>
{code}
> In DistCp, prevent unnecessary getFileStatus call when not preserving
> metadata.
> -------------------------------------------------------------------------------
>
> Key: HADOOP-13145
> URL: https://issues.apache.org/jira/browse/HADOOP-13145
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the
> {{FileStatus}} from the destination so that it can compare to the source and
> update metadata if necessary. If the DistCp command was run without the
> option to preserve metadata attributes, then this additional
> {{getFileStatus}} call is wasteful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]