Zhihua Deng created MAPREDUCE-7241:
--------------------------------------

             Summary: FileInputFormat listStatus causes oom when there are lots 
of files
                 Key: MAPREDUCE-7241
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: job submission
    Affects Versions: 2.6.1
            Reporter: Zhihua Deng
         Attachments: filestatus.png

This case sometimes sees in hive when user issues queries over all partitions 
by mistakes. The file status cached when listing status could accumulate to 
over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
shows in the filestatus.png attached.

Now we only extract the block locations info from LocatedFileStatus,  the 
datanode infos(types) or block token are not taken into account. So there is no 
need to cache LocatedBlock, as do like this:

```java

BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);

private static BlockLocation[] dedup(BlockLocation[] blockLocations) {
    BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
    int i = 0;
    for (BlockLocation location : blockLocations) {
        copyLocs[i++] = new BlockLocation(location);
    }
    return copyLocs;
}

```



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to