Bill Au wrote:
Is hadoop designed to run on homogeneous hardware only, or does it work just
as well on heterogeneous hardware as well?  If the datanodes have different
disk capacities, does HDFS still spread the data blocks equally amount all
the datanodes, or will the datanodes with high disk capacity end up storing
more data blocks?  Similarily, if the tasktrackres have different numbers of
CPUs, is there a way to configure hadoop to run more tasks on those
tasktrackers that have more CPUs?  Is that simply a matter of setting
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers?

Bill


Life is simpler on homogenous boxes; by setting the maximum tasks differently for the different machines, you do limit the amount of work that gets pushed out to those boxes. More troublesome is slower CPUs/HDDs, they arent picked up directly, though speculative work can handle some of this

One interesting bit of research would be something adaptive; something to monitor throughput and tune those values based on performance; that would detect variations in a cluster and work with with it, rather than requiring you to know the capabilities of every machine.

-steve

Reply via email to