Zhihua Deng created MAPREDUCE-7241: -------------------------------------- Summary: FileInputFormat listStatus causes oom when there are lots of files Key: MAPREDUCE-7241 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241 Project: Hadoop Map/Reduce Issue Type: Improvement Components: job submission Affects Versions: 2.6.1 Reporter: Zhihua Deng Attachments: filestatus.png
This case sometimes sees in hive when user issues queries over all partitions by mistakes. The file status cached when listing status could accumulate to over 3g. After digging into the dumped memory, the LocatedBlock occupies about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as shows in the filestatus.png attached. Now we only extract the block locations info from LocatedFileStatus, the datanode infos(types) or block token are not taken into account. So there is no need to cache LocatedBlock, as do like this: ```java BlockLocation[] blockLocations = dedup(stat.getBlockLocations()); LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations); private static BlockLocation[] dedup(BlockLocation[] blockLocations) { BlockLocation[] copyLocs = new BlockLocation[blockLocations.length]; int i = 0; for (BlockLocation location : blockLocations) { copyLocs[i++] = new BlockLocation(location); } return copyLocs; } ``` -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org