[jira] [Comment Edited] (MAPREDUCE-7241) FileInputFormat listStatus causes oom when there are lots of files in HDFS

Zhihua Deng (Jira) Wed, 18 Sep 2019 07:11:09 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932482#comment-16932482
 ]


Zhihua Deng edited comment on MAPREDUCE-7241 at 9/18/19 2:10 PM:
-----------------------------------------------------------------

 When would you do the copy? Before the returned  file status putting into the 
ArrayList, remove the unnecessary infos by reorganizing the LocatedFileStatus.  
All operations are based on the BlockLocation, wrapped by LocatedFileStatus, it 
does not need to be casting or inspecting the LocatedStatus internal structure 
as shows in the codes and more datas can be referred with less memory footprint.


was (Author: dengzh):
 When would you do the copy? Before the returned  file status putting into the 
ArrayList, remove the unnecessary infos by reorganizing the LocatedFileStatus.  
All operations are based on the BlockLocation, wrapped by LocatedFileStatus, it 
does not need to be casting or inspecting the LocatedStatus internal structure 
as shows in the codes. 

> FileInputFormat listStatus causes oom when there are lots of files in HDFS
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7241
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>    Affects Versions: 2.6.1
>            Reporter: Zhihua Deng
>            Priority: Major
>         Attachments: MAPREDUCE-7241.01.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (MAPREDUCE-7241) FileInputFormat listStatus causes oom when there are lots of files in HDFS

Reply via email to