This is a very non-typical node for hadoop clusters. 8 cores is not that uncommon, but normally nodes have only 2 disks. The rationale for a small number of spindles per machine is that total disk bandwidth scales with number of busses and controllers rather than number of spindles. The same thing goes for aggregate network bandwidth. Thus the way to increase total bandwidth is to increase the number of independent front-side busses (which is done by increasing the number of machines).
For a typical cluster of 10-100 nodes with replication = 3, you should get 80-95% data local map tasks. Experiments with small clusters like this are very sensitive to oddities such as what you noticed. Moreover, for decent switch topology, you should get very high aggregate usable bandwidth in a cluster like that. If you want to see how scaling will work in terms of data locality, try using a much larger number of virtual machines. That will give you a test bed to see how many map tasks will be data-local. It won't be an effective way to actually scale your work-load, just an interesting experiment so that you can see a larger hadoop cluster in action. On Fri, Jul 17, 2009 at 4:59 PM, Seunghwa Kang <[email protected]> wrote: > My single node has 4 dual core cpus, and 24 hard disks, I need to get > input data for 4 dual core cpus via 1 GigE network interface and seems > like this hurts the scalability. > -- Ted Dunning, CTO DeepDyve
