[ 
https://issues.apache.org/jira/browse/HADOOP-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289795#comment-17289795
 ] 

Ayush Saxena commented on HADOOP-17531:
---------------------------------------

Planning to proceed with a PR with the proposed solution in a day or two, would 
enable that by means of a config, if the config isn't set, the present flow 
will not get affected, So, s3 won't be impacted.

For HDFS or other FS, where listing isn't an issue but memory is, this can be 
enabled by this property. My present use case is HDFS to HDFS and HDFS to s3, 
Will keep a follow up Jira open to sort out s3 stuff post that...

[[email protected]]/[~rajesh.balamohan] let me know if you folks posses any 
objections

> DistCp: Reduce memory usage on copying huge directories
> -------------------------------------------------------
>
>                 Key: HADOOP-17531
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17531
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Priority: Critical
>         Attachments: MoveToStackIterator.patch, gc-NewD-512M-3.8ML.log
>
>
> Presently distCp, uses the producer-consumer kind of setup while building the 
> listing, the input queue and output queue are both unbounded, thus the 
> listStatus grows quite huge.
> Rel Code Part :
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java#L635
> This goes on bredth-first traversal kind of stuff(uses queue instead of 
> earlier stack), so if you have files at lower depth, it will like open up the 
> entire tree and the start processing....



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to