Thorsten Schuett wrote:
In my case, it looks as if the loopback device is the bottleneck. So increasing the number of tasks won't help.
Hmm. I have trouble believing that the loopback device is actually the bottleneck. What makes you think that it is?
To better support standalone use of Hadoop on multicore boxes, perhaps we should promote the MiniMR cluster code from test into the core. This runs the tasktracker and jobtracker in the same process. It still forks processes for tasks, and has all the features of a grid setup: web ui, task restarting, etc.
I don't think we should spend much effort adding multi-threading to LocalRunner, since it lacks so many of the other features of TaskTracker/JobTracker. We should also avoid re-implementing those features. Thus running TaskTracker and JobTracker in the same JVM seems like a good strategy for multicore support.
If performance with a MiniMR cluster is not good, then we should determine why. We could, e.g., benchmark and profile sort performance in this configuration. Again, I have a hard time believing that loopback bandwidth is a bottleneck. If it is, then perhaps we can optimize around it, but let's first be sure that's the case.
Note that, when running standalone, even with TaskTracker and JobTracker, one need not use HDFS. Direct access to the local filesystem will probably be considerably faster.
Doug
