On 21/01/11 09:20, Evert Lammerts wrote:
Even with performance hit, there are still benefits running Hadoop this
way
   -as you only consume/pay for CPU time you use, if you are only
running
batch jobs, its lower cost than having a hadoop cluster that is under-
used.

   -if your data is stored in the cloud infrastructure, then you need to
data mine it in VMs, unless you want to take the time and money hit of
moving it out, and have somewhere to store it.

-if the infrastructure lets you, you can lock down the cluster so it is
secure.

Where a physical cluster is good is that it is a very low cost way of
storing data, provided you can analyse it with Hadoop, and provided you
can keep that cluster busy most of the time, either with Hadoop work or
other scheduled work. If your cluster is idle for computation, you are
still paying the capital and (reduced) electricity costs, so the cost
of
storage and what compute you do effectively increases.

Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.

We don't use HDFS, we bring up VMs close to where the data persists.

http://www.slideshare.net/steve_l/high-availability-hadoop


The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.

We don't suffer locality hits so much, but you do pay for the extra infrastructure costs of a more agile datacentre, and if you go to redundancy in hardware over replication, you have less places to run your code.

Even on EC2, which doesn't let you tell it what datasets you want to play with for its VM placer to use in its decisions, once data is in the datanodes you do get locality


Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)

-would be interesting to see anything you can publish there...


I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.

+1

Reply via email to