On Wednesday 22 August 2007, Doug Cutting wrote: > Thorsten Schuett wrote: > > In my case, it looks as if the loopback device is the bottleneck. So > > increasing the number of tasks won't help. > > Hmm. I have trouble believing that the loopback device is actually the > bottleneck. What makes you think that it is? During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see what else could limit the performance here. The disk can certainly provide the data at higher speeds.
I'll be happy to repeat my experiments with the MiniMR Code. But I need a pointer how to proceed/where to start. Thorsten > To better support standalone use of Hadoop on multicore boxes, perhaps > we should promote the MiniMR cluster code from test into the core. This > runs the tasktracker and jobtracker in the same process. It still forks > processes for tasks, and has all the features of a grid setup: web ui, > task restarting, etc. > > I don't think we should spend much effort adding multi-threading to > LocalRunner, since it lacks so many of the other features of > TaskTracker/JobTracker. We should also avoid re-implementing those > features. Thus running TaskTracker and JobTracker in the same JVM seems > like a good strategy for multicore support. > > If performance with a MiniMR cluster is not good, then we should > determine why. We could, e.g., benchmark and profile sort performance > in this configuration. Again, I have a hard time believing that > loopback bandwidth is a bottleneck. If it is, then perhaps we can > optimize around it, but let's first be sure that's the case. > > Note that, when running standalone, even with TaskTracker and > JobTracker, one need not use HDFS. Direct access to the local > filesystem will probably be considerably faster. > > Doug
