This is a very non-typical node for hadoop clusters.  8 cores is not that
uncommon, but normally nodes have only 2 disks.  The rationale for a small
number of spindles per machine is that total disk bandwidth scales with
number of busses and controllers rather than number of spindles.  The same
thing goes for aggregate network bandwidth.  Thus the way to increase total
bandwidth is to increase the number of independent front-side busses (which
is done by increasing the number of machines).

For a typical cluster of 10-100 nodes with replication = 3, you should get
80-95% data local map tasks.  Experiments with small clusters like this are
very sensitive to oddities such as what you noticed.  Moreover, for decent
switch topology, you should get very high aggregate usable bandwidth in a
cluster like that.

If you want to see how scaling will work in terms of data locality, try
using a much larger number of virtual machines.  That will give you a test
bed to see how many map tasks will be data-local.  It won't be an effective
way to actually scale your work-load, just an interesting experiment so that
you can see a larger hadoop cluster in action.

On Fri, Jul 17, 2009 at 4:59 PM, Seunghwa Kang <[email protected]> wrote:

> My single node has 4 dual core cpus, and 24 hard disks, I need to get
> input data for 4 dual core cpus via 1 GigE network interface and seems
> like this hurts the scalability.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to