[ 
https://issues.apache.org/jira/browse/HADOOP-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290079#comment-15290079
 ] 

Chris Nauroth commented on HADOOP-13145:
----------------------------------------

Interestingly, you're getting a much slower run than me for S3A and a much 
faster run than me for WASB.  I'm in the US Pacific Northwest.  My S3 bucket is 
in US-west-2.  My Azure Storage account is in West US.

{code}
Running org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.389 sec - 
in org.apache.hadoop.fs.azure.contract.TestAzureNativeContractDistCp

Running org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 143.99 sec - in 
org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp
{code}

bq. Could it be made one of the scaleable tests where it takes a config of 
option on scale so can be made configurable?

We definitely could do that, but in my test runs, the large file tests don't 
show a significantly longer execution time.  (See below for my timings.)  Are 
the large file tests a long haul in your environment?

Maybe a more effective change would be to cut down the number of test cases.  I 
could keep just {{deepDirectoryStructureToRemote}}, {{largeFilesToRemote}}, 
{{deepDirectoryStructureFromRemote}} and {{largeFilesFromRemote}}.  If I do 
that, then my S3A execution time comes down to 90 seconds.  I don't think it 
sacrifices much in terms of coverage.

Let me know your thoughts, and then I'll update the patch.

{code}
  <testcase name="multipleFilesToRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="22.084"/>
  <testcase name="deepDirectoryStructureFromRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="12.973"/>
  <testcase name="deepDirectoryStructureToRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="27.658"/>
  <testcase name="largeFilesToRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="26.381"/>
  <testcase name="singleFileToRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="12.197"/>
  <testcase name="largeFilesFromRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="18.894"/>
  <testcase name="multipleFilesFromRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="9.822"/>
  <testcase name="singleFileFromRemote" 
classname="org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp" 
time="6.835"/>
{code}


> In DistCp, prevent unnecessary getFileStatus call when not preserving 
> metadata.
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-13145
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13145
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HADOOP-13145.001.patch, HADOOP-13145.003.patch
>
>
> After DistCp copies a file, it calls {{getFileStatus}} to get the 
> {{FileStatus}} from the destination so that it can compare to the source and 
> update metadata if necessary.  If the DistCp command was run without the 
> option to preserve metadata attributes, then this additional 
> {{getFileStatus}} call is wasteful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to