> If you buy the argument that EBS is resilient storage Just for the record, data has been lost in EBS.
On 6 October 2011 14:51, Jagane Sundar <[email protected]> wrote: > Note that I have changed the subject to be more relevant. > > >> >> I've just started the wiki page on this topic: >> http://wiki.apache.org/hadoop/**Virtual%20Hadoop<http://wiki.apache.org/hadoop/Virtual%20Hadoop> >> >> I will take a look at this wiki, and hopefully, contribute to it, Steve. > > >> >> >> There are two aspects to cloud friendliness - deployment >>> technologies/automation, and storage. >>> >> >> -agility to handle the failure modes of cloud infrastructure >> > > Good point. Amazon EMR starts billing for your EMR job, when 90% of the > compute VMs have fired up. They too, seem to acknowledge the possibility of > failure. > > -security in a shared infrastructure >> > > Security is a valid concern for the public cloud. An internal homebrew > openstack based cloud may not need to worry as much security. > That said, a networking construct such as Amazon VPC goes a long way towards > isolating the Hadoop. > > -flexibility based on demand >> >> Right. > > >> >> As far as deployment automation is concerned, I am eager to know what >>> other >>> approaches you are familiar with. Chef/Puppet et. al. are not interesting >>> to >>> me. I want this to have end user self-serve service characteristics, not >>> 'end users file ticket, sysadmin runs [chef|puppet|other] script'. >>> >> >> done this with a web UI: ask for the #of machines, bring up NN/JT/single DN >> master node, once that is up bring up the workers with a config that >> includes the hostname of the master node. >> >> > A person with database background who wants to use Hbase for his Big Data > processing will find the whole NN/JT/ZK etc. etc. overwhelming. Much of this > can be hidden. I think there is much work to be done in making Hadoop easier > to use. > > > >> that at least half of Amazon's customers are opting to use Apache Hadoop on >>> EC2 VMs with EBS storage (completely bypassing the EMR offering). >>> >> >> More expensive, but more flexible in terms of what you can run >> >> >> > Not clear that EBS is more expensive. If you buy the argument that EBS is > resilient storage, and one HDFS replica is adequate, then it turns out to be > ten cents a GB-month, versus fifteen cents a GB-month for S3. > >> >> Summary: I'm not sure that HDFS is the right FS in this world, as it >> contains a lot of assumptions about system stability and HDD persistence >> that aren't valid any more. With the ability to plug in new placers you >> could do tricks like ensure 1 replica lives in a persistent blockstore (and >> rely on it always being there), and add other replicas in transient storage >> if the data is about to be needed in jobs. >> > > I would be loathe to using anything other than an official Apache Hadoop and > its HDFS. My estimate is that various companies are going to pour in about > 200 Million dollars to develop Apache Hadoop. That kind of money brings in > very very smart engineers. To benefit from that ecosystem, stick with an > Apache Hadoop from the community. As a counter point, witness the quandary > Amazon is in. They are unable to react fast enough to the rise in popularity > of HBase because they chose to go with their own file system alternative to > HDFS. > > Thanks, > Jagane >
