[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284459#comment-14284459
 ] 

Jason Lowe commented on MAPREDUCE-6219:
---------------------------------------

One idea: leverage HADOOP-10987 so FileInputFormat.listStatus can iterate the 
files in chunks and null any BlockLocation entries that are not needed during 
split computation (e.g.: BlockLocation.names).  We could also consider 
interning the strings that are being preserved, since a host is probably going 
to be seen many times across a large number of block locations.

> Reduce memory required for FileInputFormat located status optimization
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6219
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6219
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.1.1-beta
>            Reporter: Jason Lowe
>            Priority: Minor
>
> MAPREDUCE-1981 introduced an optimization to drastically reduce the number of 
> namenode operations required to compute input splits when processing a 
> directory.  However it requires more memory to perform this optimization as 
> it retains the full LocatedFileStatus object for all input files while 
> computing the splits.  This can lead to odd situations for users where using 
> a directory as input can run the job client out of heap space but using 
> directory/* as the input spec allows it to run within the original heap space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to