[ 
https://issues.apache.org/jira/browse/HADOOP-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309902#comment-17309902
 ] 

Ayush Saxena commented on HADOOP-17558:
---------------------------------------

Attached a draft patch, just to highlight the direction being chased, Just for 
basic code level idea.(Nothing like I plan to remove the present code, break 
compatibility or do such raw stuff)

The key points are:
 *  Have a fixed Queue size, unlike the present one, which keeps on expanding, 
the producer is multi-threaded and consumer is single in trunk, so having a 
fixed queue size, prevents the queue getting over-sized.
 * Using CallersRunPolicy for Auto throttling
 * The threads not only just list but also consumes the files, if there is a 
directory we store it for the response, if it is a file, we process it and get 
rid of the burden of its FileStatus. Unlike the present Producer-Consumer, now 
consumption of files is atleast multi-threaded.
 * Using ListStatusIterator instead of ListStatus, Since we are consuming also, 
this should also help reduce the memory pressure.

TODO's:
 * Since we are adding futures as part of processing a future, I need to find a 
good and clean way to know when everything is done, (As of now, did some dirty 
stuff to see how it goes, Iterator isn't thread safe, something like 
{{waitForTPEIdle}} and {{checkFutures}} in my present patch)
 * Good check on synchronisation and locks, Since we are consuming the files 
and writing to the sequence file in parallel. May be having a sequence file per 
thread may be an ALT and we merge all of them in the end and get rid of this 
synchronisation problem? Not sure, need to think..
 * Test if It works in the real world (UTs do Pass), See how much performance 
gain(My -useIterator test on S3 atleast completes faster as compared to my 
previous useiterator mode), Main stuff, Test how much it is helping to save 
memory. (Nothing done as of now, All in theory as of now)

Will try sort out these things, and come up with a some more updates as I 
progress further.

cc. [~rajesh.balamohan]/ [[email protected]]/ [~weichiu]

> DistCp: Reduce memory usage using a fixed size ThreadPoolExecutor
> -----------------------------------------------------------------
>
>                 Key: HADOOP-17558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17558
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Priority: Major
>         Attachments: HADOOP-17558-DRAFT-01.patch
>
>
> For S3 and other object stores, where listing is slow, use a fixed size TPE 
> for building listing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to