2010/9/6 褚 鵬兵 <chu_pengb...@hotmail.com>: > > hi ,my hadoop friends:i have the 3 questions about hadoop.there are .... > > 1 the speed between the datanodes. Tera data in one datanodes , the data > transfers from one datanode to the another datanode. if the speed is bad, > Hadoop will be slow, i think. i heard the gNet architecture in Greenplum , > then hadoop ? SAS storage + G-Ethernet is best answer, isn't it? > 2 the GUI tool there is a hive web tool in hadoop. but it is not enough > to use it for our business work. it is too simple to use it. > if hadoop+hive is designed into DWH. then how to use it for users. by > CGI Tool(Command),? by New Developed webGUITOOL.? > 3 5 computers Hadoop cluster and 1 computer SQLSERVER2000 5 computers > Hadoop celeron 2.66G 1G memory Ethernet namenode + > secondarynamenode + 3 datanode 1 computer SQLSERVER2000 celeron 2.66G > 1G memory then i did select operation at the same data 100M . 5 > computers Hadoop is 2mins 30secs 1 computer SQLSERVER2000 is 2mins 25secs > the result is that 5 computers Hadoop is not good .why .can anyone give me > some advises. > thanks in adverse. >
Why use Hadoop in preference to a database? At the recent Hadoop User Group (UK) meeting, Andy Kemp from http://www.forward.co.uk/ presented their experience in moving from a MySQL database approach to Hadoop. >From my notes of his talk their system manages 120 million keywords and is updated at a rate of 20GB/day. They originally used a sharded MySQL database but found it couldn't scale to handle the types of queries their users required, e.g. "Can you cluster 17(?) million keyword phrases into thematic groups?". Their calculations indicated that the database approach would take more than a year to handle such a query. Moving to a cluster of 100 Hadoop nodes on Amazon EC2 reduced this time down to 7 hours. The issues then became one of the costs of storage and moving the data to and from the cluster. They then moved to a private VM system with about 30 VMs - I assume the processing took the same time as I didn't note this down. >From there they then moved to dedicated hardware, 5 dedicated Hadoop nodes, and achieved better performance than the 30 VMs. Andy's talk, "Hadoop in Context" should available as a podcast here http://skillsmatter.com/podcast/cloud-grid/hadoop-in-context and would be well worth watching but when I lasted looked it hadn't been uploaded yet. At the same event, Ian Broadhead, from http://www.playfish.com/ gave a talk on managing the activity of over 1 million active Internet gamers producing over 50GB of data a day. Their original MySQL system took up to 50 times longer to process their data load than an EC2 cluster of Hadoop nodes. He talked about a typical workload being reduced from 2-3 days (using MySQL) down to 6 hours (using Hadoop). Unfortunately I don't think Ian's talk will appear as a podcast. However, most presentations during the evening made a point that Hadoop didn't completely replace their databases, just provided a convenient way to rapidly process large volumes of data, the output from Hadoop processing typically being stored in databases to satisfy general everyday business queries. I think the common theme here was that all of these users had large datasets of the order of 100's of GBs with multiple views of that data that handled in the order of 10's of millions of updates a day. I hope that helps. Chris