Re: Hadoop as a Big Data app for the cloud

Daniel Sikar Thu, 06 Oct 2011 11:21:56 -0700

> If you buy the argument that EBS is resilient storage

Just for the record, data has been lost in EBS.


On 6 October 2011 14:51, Jagane Sundar <[email protected]> wrote:
> Note that I have changed the subject to be more relevant.
>
>
>>
>> I've just started the wiki page on this topic:
>> http://wiki.apache.org/hadoop/**Virtual%20Hadoop<http://wiki.apache.org/hadoop/Virtual%20Hadoop>
>>
>> I will take a look at this wiki, and hopefully, contribute to it, Steve.
>
>
>>
>>
>>  There are two aspects to cloud friendliness - deployment
>>> technologies/automation, and storage.
>>>
>>
>> -agility to handle the failure modes of cloud infrastructure
>>
>
> Good point. Amazon EMR starts billing for your EMR job, when 90% of the
> compute VMs have fired up. They too, seem to acknowledge the possibility of
> failure.
>
> -security in a shared infrastructure
>>
>
> Security is a valid concern for the public cloud. An internal homebrew
> openstack based cloud may not need to worry as much security.
> That said, a networking construct such as Amazon VPC goes a long way towards
> isolating the Hadoop.
>
> -flexibility based on demand
>>
>> Right.
>
>
>>
>>  As far as deployment automation is concerned, I am eager to know what
>>> other
>>> approaches you are familiar with. Chef/Puppet et. al. are not interesting
>>> to
>>> me. I want this to have end user self-serve service characteristics, not
>>> 'end users file ticket, sysadmin runs [chef|puppet|other] script'.
>>>
>>
>> done this with a web UI: ask for the #of machines, bring up NN/JT/single DN
>> master node, once that is up bring up the workers with a config that
>> includes the hostname of the master node.
>>
>>
> A person with database background who wants to use Hbase for his Big Data
> processing will find the whole NN/JT/ZK etc. etc. overwhelming. Much of this
> can be hidden. I think there is much work to be done in making Hadoop easier
> to use.
>
>
>
>> that at least half of Amazon's customers are opting to use Apache Hadoop on
>>> EC2 VMs with EBS storage (completely bypassing the EMR offering).
>>>
>>
>> More expensive, but more flexible in terms of what you can run
>>
>>
>>
> Not clear that EBS is more expensive. If you buy the argument that EBS is
> resilient storage, and one HDFS replica is adequate, then it turns out to be
> ten cents a GB-month, versus fifteen cents a GB-month for S3.
>
>>
>> Summary: I'm not sure that HDFS is the right FS in this world, as it
>> contains a lot of assumptions about system stability and HDD persistence
>> that aren't valid any more. With the ability to plug in new placers you
>> could do tricks like ensure 1 replica lives in a persistent blockstore (and
>> rely on it always being there), and add other replicas in transient storage
>> if the data is about to be needed in jobs.
>>
>
> I would be loathe to using anything other than an official Apache Hadoop and
> its HDFS. My estimate is that various companies are going to pour in about
> 200 Million dollars to develop Apache Hadoop. That kind of money brings in
> very very smart engineers. To benefit from that ecosystem, stick with an
> Apache Hadoop from the community. As a counter point, witness the quandary
> Amazon is in. They are unable to react fast enough to the rise in popularity
> of HBase because they chose to go with their own file system alternative to
> HDFS.
>
> Thanks,
> Jagane
>

Re: Hadoop as a Big Data app for the cloud

Reply via email to