On 06/10/11 03:00, Jagane Sundar wrote:
Thanks for your input, Milind. It's very useful and interesting.

In the interest of brevity, I have truncated most of it except for the point
regarding 'cloud friendly'. I have done some research into this, and want to
get some more community feedback.

2. Built in support for the cloud. (Whirr is interesting. Ambari more so,
but both fall short.)

Not very sure. If by "support for the cloud" means ability to provision
atop a hypervisor, adding or removing instances etc, I think there are
other approaches proven in the industry.



I've just started the wiki page on this topic: http://wiki.apache.org/hadoop/Virtual%20Hadoop


There are two aspects to cloud friendliness - deployment
technologies/automation, and storage.

-agility to handle the failure modes of cloud infrastructure
-security in a shared infrastructure
-flexibility based on demand

As far as deployment automation is concerned, I am eager to know what other
approaches you are familiar with. Chef/Puppet et. al. are not interesting to
me. I want this to have end user self-serve service characteristics, not
'end users file ticket, sysadmin runs [chef|puppet|other] script'.

done this with a web UI: ask for the #of machines, bring up NN/JT/single DN master node, once that is up bring up the workers with a config that includes the hostname of the master node.

Also deployed was a web UI for long haul job submission
http://www.slideshare.net/steve_l/long-haul-hadoop

That was originally all deployed with the SmartFrog framework and a modified version of the Hadoop codebase for tighter integration; these days I just untar and reconfigure a 0.20.20x .tar.gz file on the target machines, patching in the late binding information


Storage is very interesting. My own thoughts, from analyzing EC2 and EMR are
as follows. (A lot of the following is speculation and educated guesswork,
so I may be totally off, but here it is anyway):

Amazon's philosophy is totally 'on-demand bring up when needed and tear down
when done'. I like this philosophy a lot. However it does not work well for
storage. Storage needs to be always up and available. Hence, they took
Hadoop, stripped off HDFS and built a shim to S3, their object storage
service. There is no posix there. Map Reduce jobs run in VMs that are
brought up on demand, and access the S3 hosted files using the protocol s3n
(n stands for native - that's native to s3 not native to Hadoop). When this
turned out to be slow as sh**, they seem to have hacked the HDFS layer some
more, in order to actually have a NameNode for metadata, but to use S3 for
storing blocks. They have a protocol s3 to access this. Both of these
approaches have one severe failing - they do not support Append and Hflush.
ergo - no HBase on EMR. I am sure they are working furiously to address this
shortcoming and add append/hflush support to s3n or s3, in order to make it
possible to run HBase on EMR. In the meantime, anecdotal evidence suggests
that at least half of Amazon's customers are opting to use Apache Hadoop on
EC2 VMs with EBS storage (completely bypassing the EMR offering).

More expensive, but more flexible in terms of what you can run

> EBS itself
is an interesting storage technology. It is block storage offered over the
ethernet network, from an occassionally sync'd local disk elsewhere. EBS has
some storage resiliency built in, so the question of how many replicas when
HDFS is built on top of this is very interesting.


This problem of offering a cost effective Hadoop as an on-demand self
service offering in the cloud is very interesting. This is a nut I want to
crack....


Summary: I'm not sure that HDFS is the right FS in this world, as it contains a lot of assumptions about system stability and HDD persistence that aren't valid any more. With the ability to plug in new placers you could do tricks like ensure 1 replica lives in a persistent blockstore (and rely on it always being there), and add other replicas in transient storage if the data is about to be needed in jobs.

Reply via email to