On 14/09/11 22:20, Ted Dunning wrote:
This makes a bit of sense, but you have to worry about the inertia of the
data.  Adding compute resources is easy.  Adding data resources, not so
much.

I've done it. Like Ted says, pure compute nodes generate more network traffic on both reads and writes, if you bring up Datanodes then you have to leave them around. The strength is that the infrastructure can sell them to you for a lower $/hour with the condition that they can take them back when demand gets high; these compute-only nodes would be infrastructure-pre-emptible. Look for the presentation "farming hadoop in the cloud" for more details, though I don't discuss pre-emption or infrastructure-specifics.


> if the computation is not near the data, then it is likely to be
> much less effective.

Which implies your infrastructure needs to be data aware and know to bring up the new VMs on the same racks as the HDFS nodes.

There are some infrastructure-aware runtimes -I am thinking of the Technical University of Berlin's Stratosphere project- which takes a job at higher level than MR commands -more like Pig or Hive- and comes up with an execution plan that can schedule for optimal acquisition and re-use of VMs, knowing both the cost of machines and the hysteresis of VM setup/teardown and hourly costs. You can then impose policies like "fast execution" or "lowest cost", and have different plans created/executed.

This is all PhD grade research with a group of highly skilled postgrads led by a professor who has worked in VLDBs, I would not attempt to retrofit this into Hadoop on my own. That said, if you want a PhD, you could contact that team or the UC Irvine people working on Algebricks and Hyracks and convince them that you should joint their teams.

-steve

Reply via email to