[ http://issues.apache.org/jira/browse/HADOOP-249?page=all ]
Sameer Paranjpye updated HADOOP-249:
------------------------------------
Fix Version: 0.4
> Improving Map -> Reduce performance and Task JVM reuse
> ------------------------------------------------------
>
> Key: HADOOP-249
> URL: http://issues.apache.org/jira/browse/HADOOP-249
> Project: Hadoop
> Type: Improvement
> Versions: 0.3
> Reporter: Benjamin Reed
> Fix For: 0.4
> Attachments: disk_zoom.patch, image001.png, task_zoom.patch
>
> These patches are really just to make Hadoop start trotting. It is still at
> least an order of magnitude slower than it should be, but I think these
> patches are a good start.
> I've created two patches for clarity. They are not independent, but could
> easily be made so.
> The disk-zoom patch is a performance trifecta: less disk IO, less disk space,
> less CPU, and overall a tremendous improvement. The patch is based on the
> following observation: every piece of data from a map hits the disk once on
> the mapper, and 3 (+plus sorting) times on the reducer. Further, the entire
> input for the reduce step is sorted together maximizing the sort time. This
> patch causes:
> 1) the mapper to sort the relatively small fragments at the mapper which
> causes two hits to the disk, but they are smaller files.
> 2) the reducer copies the map output and may merge (if more than 100 outputs
> are present) with a couple of other outputs at copy time. No sorting is done
> since the map outputs are sorted.
> 3) the reducer will merge the map outputs on the fly in memory at reduce
> time.
> I'm attaching the performance graph (with just the disk-zoom patch) to show
> the results. This benchmark uses a random input and null output to remove any
> DFS performance influences. The cluster of 49 machines I was running on had
> limited disk space, so I was only able to run to a certain size on unmodified
> Hadoop. With the patch we use 1/3 the amount of disk space.
> The second patch allows the task tracker to reuse processes to avoid the
> over-head of starting the JVM. While JVM startup is relatively fast,
> restarting a Task causes disk IO and DFS operations that have a negative
> impact on the rest of the system. When a Task finishes, rather than exiting,
> it reads the next task to run from stdin. We still isolate the Task runtime
> from TaskTracker, but we only pay the startup penalty once.
> This second patch also fixes two performance issues not related to JVM reuse.
> (The reuse just makes the problems glaring.) First, the JobTracker counts all
> jobs not just the running jobs to decide the load on a tracker. Second, the
> TaskTracker should really ask for a new Task as soon as one finishes rather
> than wait the 10 secs.
> I've been benchmarking the code alot, but I don't have access to a really
> good cluster to try the code out on, so please treat it as experimental. I
> would love to feedback.
> There is another obvious thing to change: ReduceTasks should start after the
> first batch of MapTasks complete, so that 1) they have something to do, and
> 2) they are running on the fastest machines.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira