Thanks for the response.  Having a bunch of 50GB segments is more
manageable, but is it quick enough for user searches?

How do you decide what documents go into what segments?

Thanks again,
DaveG


-----Original Message-----
From: Chirag Chaman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 4:00 PM
To: [email protected]
Subject: RE: Scaling Nutch 0.8 via Map/Reduce


While I like the overall idea, my *personal* feeling is that 200GB
segments
are more of a problem if you want faster recovery after a failover.

We instead use multiple segments each one no more that 50GB or so, and
do
fast recovery via an sync process and a check daemon. 

NOTE: At the time we built our solution NDFS was not production quality
--
not sure where things stand now.
 

-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 04, 2006 3:40 PM
To: [email protected]
Subject: Scaling Nutch 0.8 via Map/Reduce

Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute
segments
to multiple machines via NDFS.  Let's say I've got ~250GB of hard-drive
space per machine; to store terabytes of data, should I generate a bunch
of
~200GB segments and push them out into NDFS?

 

How do I partition/organize these segments?  Randomly?  By URL or host?
The relevant use case is to randomly access a given URL or host---or is
this
accomplished via map/reduce?

 

Thanks for any insight or ideas!

 

DaveG



Reply via email to