Re: Reduce Performance

Thorsten Schuett Thu, 23 Aug 2007 06:26:25 -0700

On Wednesday 22 August 2007, Doug Cutting wrote:
> Thorsten Schuett wrote:
> > In my case, it looks as if the loopback device is the bottleneck. So
> > increasing the number of tasks won't help.
>
> Hmm.  I have trouble believing that the loopback device is actually the
> bottleneck.  What makes you think that it is?
During the copy phase of reduce, the cpu load was very low and vmstat showed 
constant reads from the disk at ~15MB/s and bursty writes. At the same time, 
data was sent over the loopback device at ~15MB/s. I don't see what else 
could limit the performance here. The disk can certainly provide the data at 
higher speeds.


I'll be happy to repeat my experiments with the MiniMR Code. But I need a 
pointer how to proceed/where to start.

Thorsten

> To better support standalone use of Hadoop on multicore boxes, perhaps
> we should promote the MiniMR cluster code from test into the core.  This
> runs the tasktracker and jobtracker in the same process.  It still forks
> processes for tasks, and has all the features of a grid setup: web ui,
> task restarting, etc.
>
> I don't think we should spend much effort adding multi-threading to
> LocalRunner, since it lacks so many of the other features of
> TaskTracker/JobTracker.  We should also avoid re-implementing those
> features.  Thus running TaskTracker and JobTracker in the same JVM seems
> like a good strategy for multicore support.
>
> If performance with a MiniMR cluster is not good, then we should
> determine why.  We could, e.g., benchmark and profile sort performance
> in this configuration.  Again, I have a hard time believing that
> loopback bandwidth is a bottleneck.  If it is, then perhaps we can
> optimize around it, but let's first be sure that's the case.
>
> Note that, when running standalone, even with TaskTracker and
> JobTracker, one need not use HDFS.  Direct access to the local
> filesystem will probably be considerably faster.
>
> Doug

Re: Reduce Performance

Reply via email to