[jira] [Commented] (HADOOP-17531) DistCp: Reduce memory usage on copying huge directories

Ayush Saxena (Jira) Tue, 16 Feb 2021 12:24:05 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285471#comment-17285471
 ]


Ayush Saxena commented on HADOOP-17531:
---------------------------------------

Thanx [[email protected]] I too initially thought of moving to 
{{listStatusIterator}}, in the pre-HADOOP-11827 it might be done, but as of now 
in the present code, 
 The producer thread doesn't process, it prepares the listing and puts in the 
output queue, so here even if I use {{listStatusIterator}}, I still need the 
entire listing, so that won't help me.
 If we think of processing too in the producer thread, then also the entries 
which will move out will be files, the directories will still be expanding.

As for example, the present processing is like this :

say source dir is /database1

it has say 10K tables, each table with 2K partitions

In first go
 My Input Queue-> /database1
 will process
 OutputQueue-> listStatus of (database1) this 10K tables
 InputQueue-> Empty
 Now will process this outputQueue, and put all the tables is input queue
 InputQueue->10K tables
 OutputQueue->Empty
 By the time I land up processing all the tables, means when last table 
listStatus is processed from output queue
 InputQueue-> 10K *2K partitions. (Here itself we are too much, @ OOM)

**Iterator stuff should help, if we pass the previous step. That too we may 
need to change the present way of working, to make producer consume.

This is going in a {{bredthFirstTraversal}} kind of stuff, which I feel isn't 
memory efficient, What I have thought as of now is to change this to 
DepthFirstTraversal

Say when we were in this state:
 InputQueue->10K tables
 OutputQueue->Empty

We process a table and then don't move towards processing other tables, but the 
partitions first of the table processed, as if the DS was stack not queue, 
prior to HADOOP-11827, it was stack only.

Some options include to have a bound on the inputQueue and outputQueue, if I 
bound both, chances are there I would land up in a deadlock, bounding one with 
DFS approach might help not sure...

Does this make sense? Do you propose any solutions to this?

> DistCp: Reduce memory usage on copying huge directories
> -------------------------------------------------------
>
>                 Key: HADOOP-17531
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17531
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Priority: Critical
>
> Presently distCp, uses the producer-consumer kind of setup while building the 
> listing, the input queue and output queue are both unbounded, thus the 
> listStatus grows quite huge.
> Rel Code Part :
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java#L635
> This goes on bredth-first traversal kind of stuff(uses queue instead of 
> earlier stack), so if you have files at lower depth, it will like open up the 
> entire tree and the start processing....



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-17531) DistCp: Reduce memory usage on copying huge directories

Reply via email to