[Nutch-general] RE: Scaling Nutch 0.8 via Map/Reduce

Chirag Chaman Sat, 07 Jan 2006 12:06:02 -0800

We keep WebDB on 2 computers (SATA RAID systems with 16 x 250GB drives), and
the segments on low-end PCs each with 250 GB drives.  Our version of the
WebDB is different as we split it up -- so, have different WebDBs running
based on the hash function -- we can do this as we removed the link
analysis.

You're better off with more machines if speed is of concern, less if rack
space/power is limited.

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 06, 2006 11:08 AM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

Thanks so much for the details!  That makes a lot of sense.  How many
documents are you indexing?  I'm working with a ~200-300 million document
index, which translates to ~2-3 terabytes of Nutch 0.8 segments.  With 250GB
of space per Linux box, I'm thinking ~8 distributed machines would be
sufficient -- maybe 16 if webdb and future scaling is considered.  Or only 8
machines with more disk space (increase from 250GB to 500GB).

>From your experience, does this sound reasonable?  :-)

Thanks again!
DaveG

-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 9:10 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

I doubt you will see the difference with four 50GB segment (maybe a few
extra 10ms). On the contrary we feel that 50GB segments are faster as we can
merge/optimize each one on a cycle.  So, we may have 10 unoptimzed segments
after 30 days (as new ones have made the old documents obsolete), and during
the refetch the dups are eliminated -- this also eases the maintenance &
merges.

The decision can be based on you. We simply create them in buckets -- taking
a hash value of the URL and ANDing (basically the mod of number of segments
-1). You can come up with your own way, but the idea is from the ground up
keep small segments and merge them as you grow and then when they grow too
big, completely eliminate them (as new smaller segments contain pretty much
the same data).

The above pretty much follows the map-reduce paradigm.

As a real-world example, we currently have numerous distributed servers
(numbering 2^n) -- each having multiple segments of different sizes (also
conforming to 2^n). So, once such server contains 8 segments of approx.
40GB
each, 8 segments of 5GB or so, and about 32 segments under 50MB. These
32
segments will get merged every few hours into one segment. Avg. search time
on this server is about 160 ms.

By breaking this down smaller and smaller based on the hash value, we know
exactly on which server and within which segment a document should be --
thus avoiding duplicates.

CC-

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 4:10 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

Thanks for the response.  Having a bunch of 50GB segments is more
manageable, but is it quick enough for user searches?

How do you decide what documents go into what segments?

Thanks again,
DaveG

-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 4:00 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

While I like the overall idea, my *personal* feeling is that 200GB segments
are more of a problem if you want faster recovery after a failover.

We instead use multiple segments each one no more that 50GB or so, and do
fast recovery via an sync process and a check daemon. 

NOTE: At the time we built our solution NDFS was not production quality
--
not sure where things stand now.

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 3:40 PM
To: [email protected]
Subject: Scaling Nutch 0.8 via Map/Reduce

Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute segments
to multiple machines via NDFS.  Let's say I've got ~250GB of hard-drive
space per machine; to store terabytes of data, should I generate a bunch of
~200GB segments and push them out into NDFS?

How do I partition/organize these segments?  Randomly?  By URL or host?
The relevant use case is to randomly access a given URL or host---or is this
accomplished via map/reduce?

Thanks for any insight or ideas!

DaveG

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Scaling Nutch 0.8 via Map/Reduce

Reply via email to