[ https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932261#comment-16932261 ]
Steve Loughran commented on MAPREDUCE-7241: ------------------------------------------- When would you do the copy? It can't be in the MR code as that works with many other stores and must not have knowledge of HDFS internals. that listLocatedStatus() call does only seem to be heavily used in FileInputFormat scans during query planning, so if we tweak its return values, things may work. After all, nobody who has used that API against any other FS can rely on the HDFS specific LocatedStatus subclass being returned, so won't be casting and inspecting it. This is something to raise as an HDFS JIRA, linking to this one. > FileInputFormat listStatus causes oom when there are lots of files in HDFS > -------------------------------------------------------------------------- > > Key: MAPREDUCE-7241 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission > Affects Versions: 2.6.1 > Reporter: Zhihua Deng > Priority: Major > Attachments: filestatus.png > > > This case sometimes sees in hive when user issues queries over all partitions > by mistakes. The file status cached when listing status could accumulate to > over 3g. After digging into the dumped memory, the LocatedBlock occupies > about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as > shows followed, > !filestatus.png! > Right now we only extract the block locations info from LocatedFileStatus, > the datanode infos(types) or block token are not taken into account. So there > is no need to cache LocatedBlock, as do like this: > BlockLocation[] blockLocations = dedup(stat.getBlockLocations()); > LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations); > private static BlockLocation[] dup(BlockLocation[] blockLocations) { > BlockLocation[] copyLocs = new BlockLocation[blockLocations.length]; > int i = 0; > for (BlockLocation location : blockLocations) > { copyLocs[i++] = new BlockLocation(location); } > return copyLocs; > } > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org