[
https://issues.apache.org/jira/browse/HADOOP-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285471#comment-17285471
]
Ayush Saxena commented on HADOOP-17531:
---------------------------------------
Thanx [[email protected]] I too initially thought of moving to
{{listStatusIterator}}, in the pre-HADOOP-11827 it might be done, but as of now
in the present code,
The producer thread doesn't process, it prepares the listing and puts in the
output queue, so here even if I use {{listStatusIterator}}, I still need the
entire listing, so that won't help me.
If we think of processing too in the producer thread, then also the entries
which will move out will be files, the directories will still be expanding.
As for example, the present processing is like this :
say source dir is /database1
it has say 10K tables, each table with 2K partitions
In first go
My Input Queue-> /database1
will process
OutputQueue-> listStatus of (database1) this 10K tables
InputQueue-> Empty
Now will process this outputQueue, and put all the tables is input queue
InputQueue->10K tables
OutputQueue->Empty
By the time I land up processing all the tables, means when last table
listStatus is processed from output queue
InputQueue-> 10K *2K partitions. (Here itself we are too much, @ OOM)
**Iterator stuff should help, if we pass the previous step. That too we may
need to change the present way of working, to make producer consume.
This is going in a {{bredthFirstTraversal}} kind of stuff, which I feel isn't
memory efficient, What I have thought as of now is to change this to
DepthFirstTraversal
Say when we were in this state:
InputQueue->10K tables
OutputQueue->Empty
We process a table and then don't move towards processing other tables, but the
partitions first of the table processed, as if the DS was stack not queue,
prior to HADOOP-11827, it was stack only.
Some options include to have a bound on the inputQueue and outputQueue, if I
bound both, chances are there I would land up in a deadlock, bounding one with
DFS approach might help not sure...
Does this make sense? Do you propose any solutions to this?
> DistCp: Reduce memory usage on copying huge directories
> -------------------------------------------------------
>
> Key: HADOOP-17531
> URL: https://issues.apache.org/jira/browse/HADOOP-17531
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Ayush Saxena
> Priority: Critical
>
> Presently distCp, uses the producer-consumer kind of setup while building the
> listing, the input queue and output queue are both unbounded, thus the
> listStatus grows quite huge.
> Rel Code Part :
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java#L635
> This goes on bredth-first traversal kind of stuff(uses queue instead of
> earlier stack), so if you have files at lower depth, it will like open up the
> entire tree and the start processing....
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]