[
https://issues.apache.org/jira/browse/HADOOP-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309902#comment-17309902
]
Ayush Saxena commented on HADOOP-17558:
---------------------------------------
Attached a draft patch, just to highlight the direction being chased, Just for
basic code level idea.(Nothing like I plan to remove the present code, break
compatibility or do such raw stuff)
The key points are:
* Have a fixed Queue size, unlike the present one, which keeps on expanding,
the producer is multi-threaded and consumer is single in trunk, so having a
fixed queue size, prevents the queue getting over-sized.
* Using CallersRunPolicy for Auto throttling
* The threads not only just list but also consumes the files, if there is a
directory we store it for the response, if it is a file, we process it and get
rid of the burden of its FileStatus. Unlike the present Producer-Consumer, now
consumption of files is atleast multi-threaded.
* Using ListStatusIterator instead of ListStatus, Since we are consuming also,
this should also help reduce the memory pressure.
TODO's:
* Since we are adding futures as part of processing a future, I need to find a
good and clean way to know when everything is done, (As of now, did some dirty
stuff to see how it goes, Iterator isn't thread safe, something like
{{waitForTPEIdle}} and {{checkFutures}} in my present patch)
* Good check on synchronisation and locks, Since we are consuming the files
and writing to the sequence file in parallel. May be having a sequence file per
thread may be an ALT and we merge all of them in the end and get rid of this
synchronisation problem? Not sure, need to think..
* Test if It works in the real world (UTs do Pass), See how much performance
gain(My -useIterator test on S3 atleast completes faster as compared to my
previous useiterator mode), Main stuff, Test how much it is helping to save
memory. (Nothing done as of now, All in theory as of now)
Will try sort out these things, and come up with a some more updates as I
progress further.
cc. [~rajesh.balamohan]/ [[email protected]]/ [~weichiu]
> DistCp: Reduce memory usage using a fixed size ThreadPoolExecutor
> -----------------------------------------------------------------
>
> Key: HADOOP-17558
> URL: https://issues.apache.org/jira/browse/HADOOP-17558
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Ayush Saxena
> Priority: Major
> Attachments: HADOOP-17558-DRAFT-01.patch
>
>
> For S3 and other object stores, where listing is slow, use a fixed size TPE
> for building listing
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]