Do you want random access for web presentation? What is your required update time? What about search index delay?
Or batch sequential access for large scale computation like pageRank? These are very different answers. The first is likely to be a standard sharded profile database with associated real-time Lucene system. See Voldemort. See the recent real-time indexing tricks from LinkedIn and IBM. See mogileFs. You are talking about a few TB of storage so you should be able to get away with a small cluster of <10 ordinary machines. Memcache might or might not be important here. Recent memcache appliance hardware is very, very impressive with an integration of SSD and memory into a single 1U box for nearly a TB of memcache speed storage. Filling a front end cache like this from MR is pretty easy. The second is likely a very simple Hadoop cluster. Depending on what you need to compute, you should be able to be happy with a moderate sized cluster for many purposes. If you want to run a full index on everything (not generally good practice) then a few dozen machines might be necessary. Detailed tests would be a good idea. On Fri, Jul 3, 2009 at 12:28 AM, Marcus Herou <[email protected]>wrote: > Now I want to ask you a question: What hardware would you use for storing a > 10-100 million blogs and 1-10 billion blog entries and make each findable > within let's say 100 msec ? I am curious since what all comes down to is > money, money to invest into the right hardware to get the most for bang for > the buck. > What is mostly needed for HBase to scale ? > Memory ? > Total amount of HDFS IO ? > CPU ? > To little memory then I guess the load go IO-bound ? > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
