[ https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931306#comment-16931306 ]
Steve Loughran commented on MAPREDUCE-7241: ------------------------------------------- I see: HdfsBlockLocation includes too much information which causes the OOM problems. At a quick scan of the code through the IDE, I don't see that extra block location info being used outside of tests. Maybe the thing to do here is (somehow) not collect that information in the DFS listLocatedStatus call? > FileInputFormat listStatus causes oom when there are lots of files > ------------------------------------------------------------------ > > Key: MAPREDUCE-7241 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission > Affects Versions: 2.6.1 > Reporter: Zhihua Deng > Priority: Major > Attachments: filestatus.png > > > This case sometimes sees in hive when user issues queries over all partitions > by mistakes. The file status cached when listing status could accumulate to > over 3g. After digging into the dumped memory, the LocatedBlock occupies > about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as > shows followed, > !filestatus.png! > Right now we only extract the block locations info from LocatedFileStatus, > the datanode infos(types) or block token are not taken into account. So there > is no need to cache LocatedBlock, as do like this: > BlockLocation[] blockLocations = dedup(stat.getBlockLocations()); > LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations); > private static BlockLocation[] dedup(BlockLocation[] blockLocations) { > BlockLocation[] copyLocs = new BlockLocation[blockLocations.length]; > int i = 0; > for (BlockLocation location : blockLocations) > { copyLocs[i++] = new BlockLocation(location); } > return copyLocs; > } > -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org