Thanks for the explanation. One further question, if I merge the indexes dir (indices) into a directory called index, should I load the index directory or the indexes directory into RAM? A follow up question is that would nutch/tomcat use the index directory if it is present over the indexes directory?
Secondly, is there any difference between the nightly builds and the svn version? I was able to build (using ant) the svn version but could NOT build the nightly build (#334), which according to hudsen is the last successful build. The failure was due to error : Buildfile: build.xml init: BUILD FAILED /home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one source--a file or resource collection. Total time: 1 second [EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31# ant package Buildfile: build.xml init: BUILD FAILED /home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one source--a file or resource collection. Total time: 1 second [EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31# Regards, Hilkiah G. Lavinier MEng (Hons), ACGI 6 Winston Lane, Goodwill, Roseau, Dominica Mbl: (767) 275 3382 Hm : (767) 440 3924 Fax: (767) 440 4991 VoIP USA: (646) 432 4487 Email: [EMAIL PROTECTED] Email: [EMAIL PROTECTED] IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] IM: ICQ #8978201 / AOL hilkiah21 ----- Original Message ---- From: Dennis Kubes <[EMAIL PROTECTED]> To: [email protected] Sent: Sunday, January 20, 2008 9:59:24 AM Subject: Re: distributed search servers Here is a link to a previous posting on the hadoop list about how we go about our setup: http://www.mail-archive.com/[email protected]/msg10088.html Long story short, create a tempfs (which is a ram file system) and stick on the indexes part (not contents or linkdb) into memory. This will increase performance 10x if not more. I don't see much performance improvement of putting the nutch site into memory (although I guess you could), as servlets (jsp) are already in memory. Currently we are testing 5M page indexes on 8G 1U boxes using a PAE kernel. Dennis Kubes Hilkiah Lavinier wrote: > Thanks for the quick response. > > Dennis, I'm not sure how to change the setting in the NutchBean, however I set the variable int hitsPerSite in search.jsp instead. > > On a performance note, do you recommend loading the indexes directory in ram (tmpfs on linux) to reduce IO and increase performance? I guess it depends on how large the index is and how much ram is available, however it sounds like a too good to be true method of squeezing out extra performance from a nutch web server. Your thoughts pls. > > > Regards, > > Hilkiah G. Lavinier MEng (Hons), ACGI > 6 Winston Lane, > Goodwill, > Roseau, Dominica > Mbl: (767) 275 3382 > Hm : (767) 440 3924 > Fax: (767) 440 4991 > VoIP USA: (646) 432 4487 > > Email: [EMAIL PROTECTED] > Email: [EMAIL PROTECTED] > IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] > IM: ICQ #8978201 / AOL hilkiah21 > > ----- Original Message ---- > From: Dennis Kubes <[EMAIL PROTECTED]> > To: [email protected] > Sent: Saturday, January 19, 2008 7:24:03 PM > Subject: Re: distributed search servers > > > > > Hilkiah Lavinier wrote: >> Hi all, >> >> Have a distributed search issue I need some advice on. The scenario > is that I have tomcat running off one server and two nutch search > servers running off two other machines (so 3 machines in total). I've setup > the nutch war to correctly call the search servers and they respond. > Problem is I get duplicate results. Now I have the same > data/information from the crawl copied on both machines so the crawl data is > replicated on both machines. >> Questions: >> 1) how do I prevent the duplicate response? If I start a third search > server I only get two duplicate responses so it doesn't seem to > increase with the number of search servers > > In your query or in NutchBean set the hitsPerSite=1, here is an > example: > > Duplicates: > http://search.isc.swlabs.org/search.jsp?lang=en&query=java > > No Duplicates: > http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1 > > This is based on hostname so for instance java.net and www.java.net > will > be considered different even though they are the same. The latter > problem has not been corrected yet in Nutch, but we are working on it. > >> 2) does tomcat wait for ALL search servers to respond before > displaying the query result or does it display the result as soon as one server > responds? > > Yes, to a timeout value. If one goes down it will slow down the entire > > search cluster. > >> 3) in terms of load sharing, what is the best approach for > distributed search servers? > > If you are looking at a round-robin sort of load balancing I would say > two nutch servers hitting different search servers with replicated > content fronted by an apache server or hardware load balancer. > Remember > that the entire search can still be up even if one or more search > servers fail. I would worry more about clustering the front end search > > website than load balancing the search servers but it all depends on > what your goal is. For a www search we don't care if a few of the > search servers are down as long as the search is functional. > > Dennis Kubes > > >> Any help would be greatly appreciated! >> >> Thanks, >> >> Hilkiah G. Lavinier MEng (Hons), ACGI >> 6 Winston Lane, >> Goodwill, >> Roseau, Dominica >> Mbl: (767) 275 3382 >> Hm : (767) 440 3924 >> Fax: (767) 440 4991 >> VoIP USA: (646) 432 4487 >> >> Email: [EMAIL PROTECTED] >> Email: [EMAIL PROTECTED] >> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] >> IM: ICQ #8978201 / AOL hilkiah21 >> >> >> >> >> >> > ____________________________________________________________________________________ >> Looking for last minute shopping deals? >> Find them fast with Yahoo! Search. > http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
