[jira] Updated: (HADOOP-249) Improving Map -> Reduce performance and Task JVM reuse

Devaraj Das (JIRA) Wed, 15 Oct 2008 23:11:40 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devaraj Das updated HADOOP-249:
-------------------------------

    Release Note: Jobs can enable task JVMs to be reused via the job config 
mapred.job.reuse.jvm.num.tasks. If this is 1 (the default), then JVMs are not 
reused (1 task per JVM). If it is -1, there is no limit to the number of tasks 
a JVM can run (of the same job). One can also specify some value greater than 
1. Also a JobConf API has been added - setNumTasksToExecutePerJvm.

> Improving Map -> Reduce performance and Task JVM reuse
> ------------------------------------------------------
>
>                 Key: HADOOP-249
>                 URL: https://issues.apache.org/jira/browse/HADOOP-249
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.0
>            Reporter: Benjamin Reed
>            Assignee: Devaraj Das
>             Fix For: 0.19.0
>
>         Attachments: 249-3.patch, 249-after-review.patch, 249-final.patch, 
> 249-with-jvmID.patch, 249.1.patch, 249.2.patch, disk_zoom.patch, 
> image001.png, task_zoom.patch
>
>
> These patches are really just to make Hadoop start trotting. It is still at 
> least an order of magnitude slower than it should be, but I think these 
> patches are a good start.
> I've created two patches for clarity. They are not independent, but could 
> easily be made so.
> The disk-zoom patch is a performance trifecta: less disk IO, less disk space, 
> less CPU, and overall a tremendous improvement. The patch is based on the 
> following observation: every piece of data from a map hits the disk once on 
> the mapper, and 3 (+plus sorting) times on the reducer. Further, the entire 
> input for the reduce step is sorted together maximizing the sort time. This 
> patch causes:
> 1)  the mapper to sort the relatively small fragments at the mapper which 
> causes two hits to the disk, but they are smaller files.
> 2) the reducer copies the map output and may merge (if more than 100 outputs 
> are present) with a couple of other outputs at copy time. No sorting is done 
> since the map outputs are sorted.
> 3) the reducer  will merge the map outputs on the fly in memory at reduce 
> time.
> I'm attaching the performance graph (with just the disk-zoom patch) to show 
> the results. This benchmark uses a random input and null output to remove any 
> DFS performance influences. The cluster of 49 machines I was running on had 
> limited disk space, so I was only able to run to a certain size on unmodified 
> Hadoop. With the patch we use 1/3 the amount of disk space.
> The second patch allows the task tracker to reuse processes to avoid the 
> over-head of starting the JVM. While JVM startup is relatively fast, 
> restarting a Task causes disk IO and DFS operations that have a negative 
> impact on the rest of the system. When a Task finishes, rather than exiting, 
> it reads the next task to run from stdin. We still isolate the Task runtime 
> from TaskTracker, but we only pay the startup penalty once.
> This second patch also fixes two performance issues not related to JVM reuse. 
> (The reuse just makes the problems glaring.) First, the JobTracker counts all 
> jobs not just the running jobs to decide the load on a tracker. Second, the 
> TaskTracker should really ask for a new Task as soon as one finishes rather 
> than wait the 10 secs.
> I've been benchmarking the code alot, but I don't have access to a really 
> good cluster to try the code out on, so please treat it as experimental. I 
> would love to feedback.
> There is another obvious thing to change: ReduceTasks should start after the 
> first batch of MapTasks complete, so that 1) they have something to do, and 
> 2) they are running on the fastest machines.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-249) Improving Map -> Reduce performance and Task JVM reuse

Reply via email to