Hi Jay! Why not use the "Google approach" and buy lots of cheap workstations/servers to distribute the search on? You can really get away cheap these days, compared to high-end servers. Even if NDFS and isn't fully up to par in 0.7-dev yet, you can still move your indices around to separate computers and distribute them that way. Writing a small client/server for this purpose can be done in a matter of hours. Gathering as much data as you have on one server sounds like a bad idea to me, no matter how monstrous that server is.
Regarding the HITS algorithm - check out the example on the Nutch website for the Internet crawl, where you select the top scorers after you finished a segment (of arbitrary size), and continue on crawling from those high-ranking sites. That way you will get the most authorative sites in your index first, which is good. Good night, Fredrik On 8/2/05, Jay Pound <[EMAIL PROTECTED]> wrote: > .... > one last important question, if I merge my indexes will it search faster > than if I don't merge them, I currently have 20 directories of 1-1.7mill > pages each. > and if I split up these indexes across multiple machines will the searching > be faster, I couldent get the nutch-server to work but I'm using 0.6. > ... > Thank you > -Jay Pound > Fromped.com > BTW windows 2000 is not 100% stable with dual core processors. nutch is ok > but cant do too many things at once or I'll get a kernel inpage error (guess > its time to migrate to 2003.net server-damn) > ----- Original Message ----- > From: "Doug Cutting" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Tuesday, August 02, 2005 1:53 PM > Subject: Re: Memory usage > > > > Try the following settings in your nutch-site.xml: > > > > <property> > > <name>io.map.index.skip</name> > > <value>7</value> > > </property> > > > > <property> > > <name>indexer.termIndexInterval</name> > > <value>1024</value> > > </property> > > > > The first causes data files to use considerably less memory. > > > > The second affects index creation, so must be done before you create the > > index you search. It's okay if your segment indexes were created > > without this, you can just (re-)merge indexes and the merged index will > > get the setting and use less memory when searching. > > > > Combining these two I have searched a 40+M page index on a machine using > > about 500MB of RAM. That said, search times with such a large index are > > not good. At some point, as your collection grows, you will want to > > merge multiple indexes containing different subsets of segments and put > > each on a separate box and search them with distributed search. > > > > Doug > > > > Jay Pound wrote: > > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to > search > > > using tomcat 5, I plan on having an index with multiple billion pages, > but > > > if this is to scale then even with 16GB of ram I wont be able to have an > > > index larger than 320million pages? how can I distribute the memory > > > requirements across multiple machines, or is there another servlet > program > > > (like resin) that will require less memory to operate, has anyone else > run > > > into this? > > > Thanks, > > > -Jay Pound > > > > > > > > > > > > > ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
