Thanks for the response. Having a bunch of 50GB segments is more manageable, but is it quick enough for user searches?
How do you decide what documents go into what segments? Thanks again, DaveG -----Original Message----- From: Chirag Chaman [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 4:00 PM To: [email protected] Subject: RE: Scaling Nutch 0.8 via Map/Reduce While I like the overall idea, my *personal* feeling is that 200GB segments are more of a problem if you want faster recovery after a failover. We instead use multiple segments each one no more that 50GB or so, and do fast recovery via an sync process and a check daemon. NOTE: At the time we built our solution NDFS was not production quality -- not sure where things stand now. -----Original Message----- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 3:40 PM To: [email protected] Subject: Scaling Nutch 0.8 via Map/Reduce Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute segments to multiple machines via NDFS. Let's say I've got ~250GB of hard-drive space per machine; to store terabytes of data, should I generate a bunch of ~200GB segments and push them out into NDFS? How do I partition/organize these segments? Randomly? By URL or host? The relevant use case is to randomly access a given URL or host---or is this accomplished via map/reduce? Thanks for any insight or ideas! DaveG
