Thanks so much for the details!  That makes a lot of sense.  How many
documents are you indexing?  I'm working with a ~200-300 million
document index, which translates to ~2-3 terabytes of Nutch 0.8
segments.  With 250GB of space per Linux box, I'm thinking ~8
distributed machines would be sufficient -- maybe 16 if webdb and future
scaling is considered.  Or only 8 machines with more disk space
(increase from 250GB to 500GB).

>From your experience, does this sound reasonable?  :-)

Thanks again!
DaveG

-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 9:10 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

I doubt you will see the difference with four 50GB segment (maybe a few
extra 10ms). On the contrary we feel that 50GB segments are faster as we
can
merge/optimize each one on a cycle.  So, we may have 10 unoptimzed
segments
after 30 days (as new ones have made the old documents obsolete), and
during
the refetch the dups are eliminated -- this also eases the maintenance &
merges.

The decision can be based on you. We simply create them in buckets --
taking
a hash value of the URL and ANDing (basically the mod of number of
segments
-1). You can come up with your own way, but the idea is from the ground
up
keep small segments and merge them as you grow and then when they grow
too
big, completely eliminate them (as new smaller segments contain pretty
much
the same data).

The above pretty much follows the map-reduce paradigm.

As a real-world example, we currently have numerous distributed servers
(numbering 2^n) -- each having multiple segments of different sizes
(also
conforming to 2^n). So, once such server contains 8 segments of approx.
40GB
each, 8 segments of 5GB or so, and about 32 segments under 50MB. These
32
segments will get merged every few hours into one segment. Avg. search
time
on this server is about 160 ms.

By breaking this down smaller and smaller based on the hash value, we
know
exactly on which server and within which segment a document should be --
thus avoiding duplicates.


CC-



 

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 4:10 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

Thanks for the response.  Having a bunch of 50GB segments is more
manageable, but is it quick enough for user searches?

How do you decide what documents go into what segments?

Thanks again,
DaveG


-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 4:00 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce


While I like the overall idea, my *personal* feeling is that 200GB
segments
are more of a problem if you want faster recovery after a failover.

We instead use multiple segments each one no more that 50GB or so, and
do
fast recovery via an sync process and a check daemon. 

NOTE: At the time we built our solution NDFS was not production quality
--
not sure where things stand now.
 

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 3:40 PM
To: [email protected]
Subject: Scaling Nutch 0.8 via Map/Reduce

Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute
segments
to multiple machines via NDFS.  Let's say I've got ~250GB of hard-drive
space per machine; to store terabytes of data, should I generate a bunch
of
~200GB segments and push them out into NDFS?

 

How do I partition/organize these segments?  Randomly?  By URL or host?
The relevant use case is to randomly access a given URL or host---or is
this
accomplished via map/reduce?

 

Thanks for any insight or ideas!

 

DaveG







-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to