Re: Parallell maps

Ted Dunning Fri, 03 Jul 2009 11:36:00 -0700

Do you want random access for web presentation?  What is your required
update time?  What about search index delay?

Or batch sequential access for large scale computation like pageRank?

These are very different answers.

The first is likely to be a standard sharded profile database with
associated real-time Lucene system.  See Voldemort.  See the recent
real-time indexing tricks from LinkedIn and IBM.  See mogileFs.  You are
talking about a few TB of storage so you should be able to get away with a
small cluster of <10 ordinary machines.  Memcache might or might not be
important here.  Recent memcache appliance hardware is very, very impressive
with an integration of SSD and memory into a single 1U box for nearly a TB
of memcache speed storage.  Filling a front end cache like this from MR is
pretty easy.

The second is likely a very simple Hadoop cluster.  Depending on what you
need to compute, you should be able to be happy with a moderate sized
cluster for many purposes.  If you want to run a full index on everything
(not generally good practice) then a few dozen machines might be necessary.

Detailed tests would be a good idea.

On Fri, Jul 3, 2009 at 12:28 AM, Marcus Herou <[email protected]>wrote:

> Now I want to ask you a question: What hardware would you use for storing a
> 10-100 million blogs and 1-10 billion blog entries and make each findable
> within let's say 100 msec ? I am curious since what all comes down to is
> money, money to invest into the right hardware to get the most for bang for
> the buck.
> What is mostly needed for HBase to scale ?
> Memory ?
> Total amount of HDFS IO ?
> CPU ?
> To little memory then I guess the load go IO-bound ?
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Parallell maps

Reply via email to