[
https://issues.apache.org/jira/browse/MAPREDUCE-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284459#comment-14284459
]
Jason Lowe commented on MAPREDUCE-6219:
---------------------------------------
One idea: leverage HADOOP-10987 so FileInputFormat.listStatus can iterate the
files in chunks and null any BlockLocation entries that are not needed during
split computation (e.g.: BlockLocation.names). We could also consider
interning the strings that are being preserved, since a host is probably going
to be seen many times across a large number of block locations.
> Reduce memory required for FileInputFormat located status optimization
> ----------------------------------------------------------------------
>
> Key: MAPREDUCE-6219
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6219
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.1.1-beta
> Reporter: Jason Lowe
> Priority: Minor
>
> MAPREDUCE-1981 introduced an optimization to drastically reduce the number of
> namenode operations required to compute input splits when processing a
> directory. However it requires more memory to perform this optimization as
> it retains the full LocatedFileStatus object for all input files while
> computing the splits. This can lead to odd situations for users where using
> a directory as input can run the job client out of heap space but using
> directory/* as the input spec allows it to run within the original heap space.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)