Thanks so much for the details! That makes a lot of sense. How many documents are you indexing? I'm working with a ~200-300 million document index, which translates to ~2-3 terabytes of Nutch 0.8 segments. With 250GB of space per Linux box, I'm thinking ~8 distributed machines would be sufficient -- maybe 16 if webdb and future scaling is considered. Or only 8 machines with more disk space (increase from 250GB to 500GB).
>From your experience, does this sound reasonable? :-) Thanks again! DaveG -----Original Message----- From: Chirag Chaman [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 9:10 PM To: [email protected] Subject: RE: Scaling Nutch 0.8 via Map/Reduce I doubt you will see the difference with four 50GB segment (maybe a few extra 10ms). On the contrary we feel that 50GB segments are faster as we can merge/optimize each one on a cycle. So, we may have 10 unoptimzed segments after 30 days (as new ones have made the old documents obsolete), and during the refetch the dups are eliminated -- this also eases the maintenance & merges. The decision can be based on you. We simply create them in buckets -- taking a hash value of the URL and ANDing (basically the mod of number of segments -1). You can come up with your own way, but the idea is from the ground up keep small segments and merge them as you grow and then when they grow too big, completely eliminate them (as new smaller segments contain pretty much the same data). The above pretty much follows the map-reduce paradigm. As a real-world example, we currently have numerous distributed servers (numbering 2^n) -- each having multiple segments of different sizes (also conforming to 2^n). So, once such server contains 8 segments of approx. 40GB each, 8 segments of 5GB or so, and about 32 segments under 50MB. These 32 segments will get merged every few hours into one segment. Avg. search time on this server is about 160 ms. By breaking this down smaller and smaller based on the hash value, we know exactly on which server and within which segment a document should be -- thus avoiding duplicates. CC- -----Original Message----- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 4:10 PM To: [email protected] Subject: RE: Scaling Nutch 0.8 via Map/Reduce Thanks for the response. Having a bunch of 50GB segments is more manageable, but is it quick enough for user searches? How do you decide what documents go into what segments? Thanks again, DaveG -----Original Message----- From: Chirag Chaman [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 4:00 PM To: [email protected] Subject: RE: Scaling Nutch 0.8 via Map/Reduce While I like the overall idea, my *personal* feeling is that 200GB segments are more of a problem if you want faster recovery after a failover. We instead use multiple segments each one no more that 50GB or so, and do fast recovery via an sync process and a check daemon. NOTE: At the time we built our solution NDFS was not production quality -- not sure where things stand now. -----Original Message----- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 3:40 PM To: [email protected] Subject: Scaling Nutch 0.8 via Map/Reduce Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute segments to multiple machines via NDFS. Let's say I've got ~250GB of hard-drive space per machine; to store terabytes of data, should I generate a bunch of ~200GB segments and push them out into NDFS? How do I partition/organize these segments? Randomly? By URL or host? The relevant use case is to randomly access a given URL or host---or is this accomplished via map/reduce? Thanks for any insight or ideas! DaveG ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_idv37&alloc_id865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
