[
https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer resolved MAPREDUCE-1690.
-----------------------------------------
Resolution: Won't Fix
Closing as stale.
> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
> Key: MAPREDUCE-1690
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task, tasktracker
> Affects Versions: 0.20.2, 0.20.3
> Reporter: luoli
> Fix For: 0.20.2
>
> Attachments: ASF.LICENSE.NOT.GRANTED--allo_use_buddy.JPG,
> ASF.LICENSE.NOT.GRANTED--allo_use_buddy_gc.JPG,
> ASF.LICENSE.NOT.GRANTED--allo_use_new.JPG,
> ASF.LICENSE.NOT.GRANTED--allo_use_new_gc.JPG,
> ASF.LICENSE.NOT.GRANTED--mapreduce-1690.v1.patch,
> ASF.LICENSE.NOT.GRANTED--mapreduce-1690.v1.patch,
> ASF.LICENSE.NOT.GRANTED--mapreduce-1690.v1.patch,
> ASF.LICENSE.NOT.GRANTED--mapreduce-1690.v2.patch
>
>
> When the reduce task launched, it will start several MapOutputCopier
> threads to download the output from finished map, every thread is a
> MapOutputCopier thread running instance. Every time the thread trying to copy
> map output from remote from local, the MapOutputCopier thread will desides to
> shuffle the map output data in memory or to disk, this depends on the map
> output data size and the configuration of the ShuffleRamManager which loaded
> from the client hadoop-site.xml or JobConf, no matter what, if the reduce
> task decides to shuffle the map output data in memory , the MapOutputCopier
> will connect to the remote map host , read the map output in the socket, and
> then copy map-output into an in-memory buffer, and every time, the in-memory
> buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is
> where the problem begin. In our cluster, there are some special jobs which
> will process a huge number of original data, say 110TB, so the reduce tasks
> will shuffle a lot of data, some shuffled to disk and some shuffle in memory,
> even though, their will be a lot of data shuffled in memory, and every time
> the MapOutputCopier threads will "new" some memory from the reduce heap, for
> a long-running-huge-data job, this will easily feed the Reduce Task's heap
> size to the full, make the reduce task to OOM and then exhausted the memory
> of the TaskTracker machine.
> Here is our solution: Change the code logic when MapOutputCopier
> threads shuffle map-output in memory, using a BuddySystem similar to the
> Linux Kernel BuddySystem which used to allocate and deallocate memory page.
> When the reduce task launched , initialize some memory to this BuddySystem,
> say 128MB, everytime the reduce want to shuffle map-output in memory ,just
> require memory buffer from the buddySystem, if the buddySystem has enough
> memory , use it, and if not , let the MapOutputCopier threads to wait() just
> like what they do right now in the current hadoop shuffle code logic. This
> will reduce the Reduce Task's memory usage and reduce the TaskTracker memory
> shortage a lot. In our cluster, this buddySystem makes the situation of "lost
> a batch of tasktrackers because of memory over used when the huge jobs
> running " disappeared. And therefore makes the cluster more stable.
--
This message was sent by Atlassian JIRA
(v6.2#6252)