Re: the question of hadoop

Chris Smith Wed, 08 Sep 2010 11:50:07 -0700

2010/9/6 褚 鵬兵 <chu_pengb...@hotmail.com>:
>
> hi ,my hadoop friends:i have the 3 questions about hadoop.there are ....
>
> 1 the speed between the datanodes.   Tera data in one datanodes ,   the data  
> transfers from one datanode to the another datanode.   if the speed  is bad, 
> Hadoop will be slow, i think.   i heard the gNet architecture in Greenplum ,  
> then hadoop ?  SAS storage + G-Ethernet is best answer, isn't it?
> 2 the GUI tool   there is a hive web tool in hadoop.   but it is not enough 
> to use it for our business work.   it is too simple to use it.
>   if hadoop+hive is designed into DWH.   then how to use it for users.   by 
> CGI Tool(Command),?   by New Developed webGUITOOL.?
> 3 5 computers Hadoop cluster and 1 computer SQLSERVER2000   5 computers 
> Hadoop      celeron 2.66G      1G memory      Ethernet      namenode + 
> secondarynamenode + 3 datanode   1 computer SQLSERVER2000      celeron 2.66G  
>     1G memory  then i did select operation at the same data 100M .    5 
> computers Hadoop  is 2mins 30secs   1 computer SQLSERVER2000  is 2mins 25secs
> the result is that  5 computers Hadoop is not good .why .can anyone give me 
> some advises.
> thanks in adverse.
>


Why use Hadoop in preference to a database?

At the recent Hadoop User Group (UK) meeting, Andy Kemp from
http://www.forward.co.uk/ presented their experience in moving from a
MySQL database approach to Hadoop.

>From my notes of his talk their system manages 120 million keywords
and is updated at a rate of 20GB/day.

They originally used a sharded MySQL database but found it couldn't
scale to handle the types of queries their users required, e.g. "Can
you cluster 17(?) million keyword phrases into thematic groups?".
Their calculations indicated that the database approach would take
more than a year to handle such a query.

Moving to a cluster of 100 Hadoop nodes on Amazon EC2 reduced this
time down to 7 hours.  The issues then became one of the costs of
storage and moving the data to and from the cluster.

They then moved to a private VM system with about 30 VMs - I assume
the processing took the same time as I didn't note this down.

>From there they then moved to dedicated hardware, 5 dedicated Hadoop
nodes, and achieved better performance than the 30 VMs.

Andy's talk, "Hadoop in Context" should available as a podcast here
http://skillsmatter.com/podcast/cloud-grid/hadoop-in-context and would
be well worth watching but when I lasted looked it hadn't been
uploaded yet.

At the same event, Ian Broadhead, from http://www.playfish.com/ gave a
talk on managing the activity of over 1 million active Internet gamers
producing over 50GB of data a day.  Their original MySQL system took
up to 50 times longer to process their data load than an EC2 cluster
of Hadoop nodes.  He talked about a typical workload being reduced
from 2-3 days (using MySQL) down to 6 hours (using Hadoop).
Unfortunately I don't think Ian's talk will appear as a podcast.

However, most presentations during the evening made a point that
Hadoop didn't completely replace their databases, just provided a
convenient way to rapidly process large volumes of data, the output
from Hadoop processing typically being stored in databases to satisfy
general everyday business queries.

I think the common theme here was that all of these users had large
datasets of the order of 100's of GBs with multiple views of that data
that handled in the order of 10's of millions of updates a day.

I hope that helps.

Chris

Re: the question of hadoop

Reply via email to