I doubt you will see the difference with four 50GB segment (maybe a few
extra 10ms). On the contrary we feel that 50GB segments are faster as we can
merge/optimize each one on a cycle.  So, we may have 10 unoptimzed segments
after 30 days (as new ones have made the old documents obsolete), and during
the refetch the dups are eliminated -- this also eases the maintenance &
merges.

The decision can be based on you. We simply create them in buckets -- taking
a hash value of the URL and ANDing (basically the mod of number of segments
-1). You can come up with your own way, but the idea is from the ground up
keep small segments and merge them as you grow and then when they grow too
big, completely eliminate them (as new smaller segments contain pretty much
the same data).

The above pretty much follows the map-reduce paradigm.

As a real-world example, we currently have numerous distributed servers
(numbering 2^n) -- each having multiple segments of different sizes (also
conforming to 2^n). So, once such server contains 8 segments of approx. 40GB
each, 8 segments of 5GB or so, and about 32 segments under 50MB. These 32
segments will get merged every few hours into one segment. Avg. search time
on this server is about 160 ms.

By breaking this down smaller and smaller based on the hash value, we know
exactly on which server and within which segment a document should be --
thus avoiding duplicates.


CC-



 

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 4:10 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce

Thanks for the response.  Having a bunch of 50GB segments is more
manageable, but is it quick enough for user searches?

How do you decide what documents go into what segments?

Thanks again,
DaveG


-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 4:00 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce


While I like the overall idea, my *personal* feeling is that 200GB
segments
are more of a problem if you want faster recovery after a failover.

We instead use multiple segments each one no more that 50GB or so, and
do
fast recovery via an sync process and a check daemon. 

NOTE: At the time we built our solution NDFS was not production quality
--
not sure where things stand now.
 

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 3:40 PM
To: [email protected]
Subject: Scaling Nutch 0.8 via Map/Reduce

Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute
segments
to multiple machines via NDFS.  Let's say I've got ~250GB of hard-drive
space per machine; to store terabytes of data, should I generate a bunch
of
~200GB segments and push them out into NDFS?

 

How do I partition/organize these segments?  Randomly?  By URL or host?
The relevant use case is to randomly access a given URL or host---or is
this
accomplished via map/reduce?

 

Thanks for any insight or ideas!

 

DaveG





Reply via email to