[ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391432#comment-14391432
 ] 

Colin Patrick McCabe commented on HADOOP-11785:
-----------------------------------------------

Thanks, [~3opan].  This looks good in general.

bq. Should I mark this as a bug fix instead of improvement?

I don't see this as a bug because the functionality is correct.  It seems to be 
an improvement.

{code}
-   * Collect the list of 
+   * Collect the list of
-   *     the the source root is a directory, then the source root entry is not 
+   *     the the source root is a directory, then the source root entry is not
-    if (fileStatus.getPath().equals(sourcePathRoot) && 
+    if (fileStatus.getPath().equals(sourcePathRoot) &&
{code}

Can you remove these whitespace changes from the patch?  It's distracting and 
it makes it look like things have changed, when in fact they have not.  I think 
there are a few other whitespace changes as well.

{{traverseDirectory}}: Maybe we can optimize this even more.  Can we pass in 
the sourceFS to this function, rather than calling {{Path#getFileSystem}}?  
{{Path#getFileSystem}} requires some synchronization which might add overheads.

It looks good aside from that.  thanks

> Reduce number of listStatus operation in distcp buildListing()
> --------------------------------------------------------------
>
>                 Key: HADOOP-11785
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11785
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Zoran Dimitrijevic
>            Assignee: Zoran Dimitrijevic
>            Priority: Minor
>         Attachments: distcp-liststatus.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to