prevent Merger fd  leak when there are lots empty segments in mem
-----------------------------------------------------------------

                 Key: MAPREDUCE-1424
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1424
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
            Reporter: Xing Shi


The Merger will open too many files on disk, when there are too many empty 
segments in shuffle mem.


We process larger data , eg. > 100T,in one Job. And we use our partitioner to 
partition the map output,and one map output will wholely shuffle to one 
reduce。So the other reduce will get lots of empty segments.

                   whole
map_n  |  ̄ ̄ ̄ ̄--->      reduce1
             |    empty    
             |  ̄ ̄ ̄ ̄--->      reduce2
             |    empty
             |  ̄ ̄ ̄ ̄--->      reduce3
             |   empty
             |  ̄ ̄ ̄ ̄--->      reduce4

Because, our input data is bigger, so there are lots of map(10^5). And mostly 
there are several thousands maps to one reduce, and several thousands empty 
segments. 

For example:
     1000 mapOutput(on disk) + 3000 empty segments(in mem)

Then, as the io.sort.factor=100

    in first merge cycle, the merger will merge 10+3000 segments [ by 
getPassFactor (1000 - 1)%100 + 1 + 30000 ],because there is no real data in 
mem, then we should use the left 990 mapOutput to replace the empty 3000 mem 
segments, then we open 1000 fd.

    Once there are several reduce on one taskTracker, we will open several 
thousand fds.


    I think we can use first collection to remove the empty segments, moreover 
in shuffle phase, we also can not add the segment into mem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to