Wow, a pile of questions. :) Is this for a web-wide search engine? Otis
--- Jay Pound <[EMAIL PROTECTED]> wrote: > whats the bottleneck for the slow searching, I'm monitoring it and > its doing > about 57% cpu load when I'm searching , it takes about 50secs to > bring up > the results page the first time, then if I search for the same thing > again > its much faster. > Doug, can I trash my segments after they are indexed, I don't want to > have > cached access to the pages do the segments still need to be there? my > 30mil > page index/segment is using over 300gb I have the space, but when I > get to > the hundreds of millions of pages I will run out of room on my raid > controler's for hd expansion, I'm planning on moving to lustre if > ndfs is > not stable by then. I plan on having a multi billion page index if > the > memory requirements for that can be below 16gb per search node. right > now > I'm getting pretty crappy results from my 30 million pages, I read > the > whitepaper on Authoritative Sources in a Hyperlinked Environment > because > someone said thats how the nutch algorithm worked, so I'm assuming as > my > index grows the pages that deserve top placement will recieve top > placement, > but I don't know if I should re-fetch a new set of segments with root > url's > just ending in US extensions(.com.edu etc...) I made a small set > testing > this theory (100000 pages) and its results were much better than my > results > from the 30mill page index. whats your thought on this, am I right in > thinking that the pages with the most pages linking to them will show > up > first? so if I index 500 million pages my results should be on par > with the > rest of the "big dogs"? > > one last important question, if I merge my indexes will it search > faster > than if I don't merge them, I currently have 20 directories of > 1-1.7mill > pages each. > and if I split up these indexes across multiple machines will the > searching > be faster, I couldent get the nutch-server to work but I'm using 0.6. > > I have a very fast server I didnt know if the searching would take > advantage > of smp, fetching will and I can run multiple index's at the same > time. my HD > array is 200MB a sec i/o I have the new dual core opteron 275 italy > core > with 4gb ram, working my way to 16gb when I need it and a second > processor > when I need it, 1.28TB of hd space for nutch currently with expansion > up to > 5.12TB, I'm currently running windows 2000 on it as they havent made > a > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my > scalability > will be to 960MB a sec with all the drives in the system and 4x2.2 > Ghz > processor cores. untill I need to cluster thats what I have to play > with for > nutch. > in case you guys needed to know what hardware I'm running > Thank you > -Jay Pound > Fromped.com > BTW windows 2000 is not 100% stable with dual core processors. nutch > is ok > but cant do too many things at once or I'll get a kernel inpage error > (guess > its time to migrate to 2003.net server-damn) > ----- Original Message ----- > From: "Doug Cutting" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Tuesday, August 02, 2005 1:53 PM > Subject: Re: Memory usage > > > > Try the following settings in your nutch-site.xml: > > > > <property> > > <name>io.map.index.skip</name> > > <value>7</value> > > </property> > > > > <property> > > <name>indexer.termIndexInterval</name> > > <value>1024</value> > > </property> > > > > The first causes data files to use considerably less memory. > > > > The second affects index creation, so must be done before you > create the > > index you search. It's okay if your segment indexes were created > > without this, you can just (re-)merge indexes and the merged index > will > > get the setting and use less memory when searching. > > > > Combining these two I have searched a 40+M page index on a machine > using > > about 500MB of RAM. That said, search times with such a large > index are > > not good. At some point, as your collection grows, you will want > to > > merge multiple indexes containing different subsets of segments and > put > > each on a separate box and search them with distributed search. > > > > Doug > > > > Jay Pound wrote: > > > I'm testing an index of 30 million pages, it requires 1.5gb of > ram to > search > > > using tomcat 5, I plan on having an index with multiple billion > pages, > but > > > if this is to scale then even with 16GB of ram I wont be able to > have an > > > index larger than 320million pages? how can I distribute the > memory > > > requirements across multiple machines, or is there another > servlet > program > > > (like resin) that will require less memory to operate, has anyone > else > run > > > into this? > > > Thanks, > > > -Jay Pound > > > > > > > > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration > Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general >
