Bill Au wrote:
Is hadoop designed to run on homogeneous hardware only, or does it work just
as well on heterogeneous hardware as well? If the datanodes have different
disk capacities, does HDFS still spread the data blocks equally amount all
the datanodes, or will the datanodes with high disk capacity end up storing
more data blocks? Similarily, if the tasktrackres have different numbers of
CPUs, is there a way to configure hadoop to run more tasks on those
tasktrackers that have more CPUs? Is that simply a matter of setting
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers?
Bill
Life is simpler on homogenous boxes; by setting the maximum tasks
differently for the different machines, you do limit the amount of work
that gets pushed out to those boxes. More troublesome is slower
CPUs/HDDs, they arent picked up directly, though speculative work can
handle some of this
One interesting bit of research would be something adaptive; something
to monitor throughput and tune those values based on performance; that
would detect variations in a cluster and work with with it, rather than
requiring you to know the capabilities of every machine.
-steve