Obviously not: it must be for « [urls] just ending in US extensions(.com.edu etc...) ». :))
Anyway, it all sounds very impressive! Good luck with your investigations and please keep us posted. Regards, Sébastien. --- [EMAIL PROTECTED] a écrit : > Wow, a pile of questions. :) > Is this for a web-wide search engine? > > Otis > > > --- Jay Pound <[EMAIL PROTECTED]> wrote: > > > whats the bottleneck for the slow searching, I'm monitoring it and > > its doing > > about 57% cpu load when I'm searching , it takes about 50secs to > > bring up > > the results page the first time, then if I search for the same > thing > > again > > its much faster. > > Doug, can I trash my segments after they are indexed, I don't want > to > > have > > cached access to the pages do the segments still need to be there? > my > > 30mil > > page index/segment is using over 300gb I have the space, but when I > > get to > > the hundreds of millions of pages I will run out of room on my raid > > controler's for hd expansion, I'm planning on moving to lustre if > > ndfs is > > not stable by then. I plan on having a multi billion page index if > > the > > memory requirements for that can be below 16gb per search node. > right > > now > > I'm getting pretty crappy results from my 30 million pages, I read > > the > > whitepaper on Authoritative Sources in a Hyperlinked Environment > > because > > someone said thats how the nutch algorithm worked, so I'm assuming > as > > my > > index grows the pages that deserve top placement will recieve top > > placement, > > but I don't know if I should re-fetch a new set of segments with > root > > url's > > just ending in US extensions(.com.edu etc...) I made a small set > > testing > > this theory (100000 pages) and its results were much better than my > > results > > from the 30mill page index. whats your thought on this, am I right > in > > thinking that the pages with the most pages linking to them will > show > > up > > first? so if I index 500 million pages my results should be on par > > with the > > rest of the "big dogs"? > > > > one last important question, if I merge my indexes will it search > > faster > > than if I don't merge them, I currently have 20 directories of > > 1-1.7mill > > pages each. > > and if I split up these indexes across multiple machines will the > > searching > > be faster, I couldent get the nutch-server to work but I'm using > 0.6. > > > > I have a very fast server I didnt know if the searching would take > > advantage > > of smp, fetching will and I can run multiple index's at the same > > time. my HD > > array is 200MB a sec i/o I have the new dual core opteron 275 italy > > core > > with 4gb ram, working my way to 16gb when I need it and a second > > processor > > when I need it, 1.28TB of hd space for nutch currently with > expansion > > up to > > 5.12TB, I'm currently running windows 2000 on it as they havent > made > > a > > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my > > scalability > > will be to 960MB a sec with all the drives in the system and 4x2.2 > > Ghz > > processor cores. untill I need to cluster thats what I have to play > > with for > > nutch. > > in case you guys needed to know what hardware I'm running > > Thank you > > -Jay Pound > > Fromped.com > > BTW windows 2000 is not 100% stable with dual core processors. > nutch > > is ok > > but cant do too many things at once or I'll get a kernel inpage > error > > (guess > > its time to migrate to 2003.net server-damn) > > ----- Original Message ----- > > From: "Doug Cutting" <[EMAIL PROTECTED]> > > To: <[email protected]> > > Sent: Tuesday, August 02, 2005 1:53 PM > > Subject: Re: Memory usage > > > > > > > Try the following settings in your nutch-site.xml: > > > > > > <property> > > > <name>io.map.index.skip</name> > > > <value>7</value> > > > </property> > > > > > > <property> > > > <name>indexer.termIndexInterval</name> > > > <value>1024</value> > > > </property> > > > > > > The first causes data files to use considerably less memory. > > > > > > The second affects index creation, so must be done before you > > create the > > > index you search. It's okay if your segment indexes were created > > > without this, you can just (re-)merge indexes and the merged > index > > will > > > get the setting and use less memory when searching. > > > > > > Combining these two I have searched a 40+M page index on a > machine > > using > > > about 500MB of RAM. That said, search times with such a large > > index are > > > not good. At some point, as your collection grows, you will want > > to > > > merge multiple indexes containing different subsets of segments > and > > put > > > each on a separate box and search them with distributed search. > > > > > > Doug > > > > > > Jay Pound wrote: > > > > I'm testing an index of 30 million pages, it requires 1.5gb of > > ram to > > search > > > > using tomcat 5, I plan on having an index with multiple billion > > pages, > > but > > > > if this is to scale then even with 16GB of ram I wont be able > to > > have an > > > > index larger than 320million pages? how can I distribute the > > memory > > > > requirements across multiple machines, or is there another > > servlet > > program > > > > (like resin) that will require less memory to operate, has > anyone > > else > > run > > > > into this? > > > > Thanks, > > > > -Jay Pound > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: Discover Easy Linux Migration > > Strategies > > from IBM. Find simple to follow Roadmaps, straightforward articles, > > informative Webcasts and more! Get everything you need to get up to > > speed, fast. > http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > > _______________________________________________ > > Nutch-general mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > ___________________________________________________________________________ Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com
