Thanks for your input, Milind. It's very useful and interesting. In the interest of brevity, I have truncated most of it except for the point regarding 'cloud friendly'. I have done some research into this, and want to get some more community feedback.
>2. Built in support for the cloud. (Whirr is interesting. Ambari more so, > >but both fall short.) > > Not very sure. If by "support for the cloud" means ability to provision > atop a hypervisor, adding or removing instances etc, I think there are > other approaches proven in the industry. > > There are two aspects to cloud friendliness - deployment technologies/automation, and storage. As far as deployment automation is concerned, I am eager to know what other approaches you are familiar with. Chef/Puppet et. al. are not interesting to me. I want this to have end user self-serve service characteristics, not 'end users file ticket, sysadmin runs [chef|puppet|other] script'. Storage is very interesting. My own thoughts, from analyzing EC2 and EMR are as follows. (A lot of the following is speculation and educated guesswork, so I may be totally off, but here it is anyway): Amazon's philosophy is totally 'on-demand bring up when needed and tear down when done'. I like this philosophy a lot. However it does not work well for storage. Storage needs to be always up and available. Hence, they took Hadoop, stripped off HDFS and built a shim to S3, their object storage service. There is no posix there. Map Reduce jobs run in VMs that are brought up on demand, and access the S3 hosted files using the protocol s3n (n stands for native - that's native to s3 not native to Hadoop). When this turned out to be slow as sh**, they seem to have hacked the HDFS layer some more, in order to actually have a NameNode for metadata, but to use S3 for storing blocks. They have a protocol s3 to access this. Both of these approaches have one severe failing - they do not support Append and Hflush. ergo - no HBase on EMR. I am sure they are working furiously to address this shortcoming and add append/hflush support to s3n or s3, in order to make it possible to run HBase on EMR. In the meantime, anecdotal evidence suggests that at least half of Amazon's customers are opting to use Apache Hadoop on EC2 VMs with EBS storage (completely bypassing the EMR offering). EBS itself is an interesting storage technology. It is block storage offered over the ethernet network, from an occassionally sync'd local disk elsewhere. EBS has some storage resiliency built in, so the question of how many replicas when HDFS is built on top of this is very interesting. This problem of offering a cost effective Hadoop as an on-demand self service offering in the cloud is very interesting. This is a nut I want to crack.... Sorry about the long rant, and again, it is all in the hope that I can evoke some postings from people who know more about this than I do. Thanks, Jagane
