[jira] Commented: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle

luoli (JIRA) Thu, 15 Apr 2010 01:29:17 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857232#action_12857232
 ]


luoli commented on MAPREDUCE-1690:
----------------------------------

This is the performance comparison when I using buddysystem to allocate memory 
and just using "new" operator to allocate it from heap.
My Test program try to imitate the situation when reduce task shuffle 
map-output data in memory: There are several(the number is from the 
mapred.reduce.parallel.copies config option) MapOutputCopier threads running in 
the reduce process to copy map-output, each thread require some memory when it 
want to shuffle the data in memory. so I keep trying to allocate some memory 
from heap using "new", and than next time using buddysystem, we can observe the 
difference between the heap of the test program.The max heap size of the test 
program is 300MB.

heap memory using "new":!allo_use_new.JPG!


gc running using "new": !allo_use_new_gc.JPG!

heap memory using buddysystem:!allo_use_buddy.JPG!


gc running using buddysystem: !allo_use_buddy_gc.JPG!

And , their still performance difference: don't like the "new" operation which 
required memory from heap, thanks the buddysystem, the memory allocate operator 
in buddysystem is just some search operation, and this is much more quick. In 
my test , buddy's allocation is 100 times quicker than "new".

When this patch merge into our cluster, the memory usage of tasktracker is more 
stable . And no tasktracker get lost because of memory overused when huge job 
running.

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2
>
>         Attachments: allo_use_buddy.JPG, allo_use_buddy_gc.JPG, 
> allo_use_new.JPG, allo_use_new_gc.JPG, mapreduce-1690.v1.patch, 
> mapreduce-1690.v1.patch, mapreduce-1690.v1.patch, mapreduce-1690.v2.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopier 
> threads to download the output from finished map, every thread is a 
> MapOutputCopier thread running instance. Every time the thread trying to copy 
> map output from remote from local, the MapOutputCopier thread will desides to 
> shuffle the map output data in memory or to disk, this depends on the map 
> output data size and the configuration of the ShuffleRamManager which loaded 
> from the client hadoop-site.xml or JobConf, no matter what, if the reduce 
> task decides to shuffle the map output data in memory , the MapOutputCopier 
> will connect to the remote map host , read the map output in the socket, and 
> then  copy map-output into an in-memory buffer, and every time, the in-memory 
> buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is 
> where the problem begin. In our cluster, there are some special jobs which 
> will process a huge number of original data, say 110TB,  so the reduce tasks 
> will shuffle a lot of data, some shuffled to disk and some shuffle in memory, 
> even though, their will be a lot of data shuffled in memory, and every time 
> the MapOutputCopier threads will "new" some memory from the reduce heap, for 
> a long-running-huge-data job, this will easily feed the Reduce Task's heap 
> size to the full,  make the reduce task to OOM and then exhausted the memory 
> of the TaskTracker machine.
>        Here is our solution: Change the code logic when MapOutputCopier 
> threads shuffle map-output in memory, using a BuddySystem similar to the 
> Linux Kernel  BuddySystem which used to allocate and deallocate memory page. 
> When the reduce task launched , initialize some memory to this BuddySystem, 
> say 128MB, everytime the reduce want to shuffle map-output in memory ,just 
> require memory buffer from the buddySystem, if the buddySystem has enough 
> memory , use it, and if not , let  the MapOutputCopier threads to wait() just 
> like what they do right now in the current hadoop shuffle code logic. This 
> will reduce the Reduce Task's memory usage and reduce the TaskTracker memory 
> shortage a lot. In our cluster, this buddySystem makes the situation of "lost 
> a batch of tasktrackers because of memory over used when the huge jobs 
> running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle

Reply via email to