Hello, I am experiencing a similar problem with "db.ignore.external.links". Did You find any solution?
Best, Oleg Mürk Hilkiah Lavinier wrote: > > Hi I need to better understand the impact of the db.ignore.external.links > property. > > I have this set to true in my nutch-site.xml file. Based on the > description, I expect that links to sites not included in the initial > inject list won't get indexed. However after running a -depth 10 from an > initial list of 15 sites, nutch has indexed (confirmed from searching with > tomcat) hundreds of sites that were NOT included in the initial seed list. > How come? Is there some other option that I must set to say "only index > the pages for the sites included in the initially supplied seed list". > > For whats its worth I'm using the urlfilter-suffix instead of the > urlfilter-regex since I read somewhere that the regex filter causes > crashes and the suffix one is more stable etc. > > Thanks, > > Hilkiah G. Lavinier MEng (Hons), ACGI > 6 Winston Lane, > Goodwill, > Roseau, Dominica > Mbl: (767) 275 3382 > Hm : (767) 440 3924 > Fax: (767) 440 4991 > VoIP USA: (646) 432 4487 > > Email: [EMAIL PROTECTED] > Email: [EMAIL PROTECTED] > IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] > IM: ICQ #8978201 / AOL hilkiah21 > > ----- Original Message ---- > From: Hilkiah Lavinier <[EMAIL PROTECTED]> > To: [email protected] > Sent: Saturday, January 19, 2008 8:35:18 PM > Subject: Re: distributed search servers > > > Thanks for the quick response. > > Dennis, I'm not sure how to change the setting in the NutchBean, > however I set the variable int hitsPerSite in search.jsp instead. > > On a performance note, do you recommend loading the indexes directory > in ram (tmpfs on linux) to reduce IO and increase performance? I guess > it depends on how large the index is and how much ram is available, > however it sounds like a too good to be true method of squeezing out > extra > performance from a nutch web server. Your thoughts pls. > > > Regards, > > Hilkiah G. Lavinier MEng (Hons), ACGI > 6 Winston Lane, > Goodwill, > Roseau, Dominica > Mbl: (767) 275 3382 > Hm : (767) 440 3924 > Fax: (767) 440 4991 > VoIP USA: (646) 432 4487 > > Email: [EMAIL PROTECTED] > Email: [EMAIL PROTECTED] > IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] > IM: ICQ #8978201 / AOL hilkiah21 > > ----- Original Message ---- > From: Dennis Kubes <[EMAIL PROTECTED]> > To: [email protected] > Sent: Saturday, January 19, 2008 7:24:03 PM > Subject: Re: distributed search servers > > > > > Hilkiah Lavinier wrote: >> Hi all, >> >> Have a distributed search issue I need some advice on. The scenario > is that I have tomcat running off one server and two nutch search > servers running off two other machines (so 3 machines in total). I've > setup > the nutch war to correctly call the search servers and they respond. > Problem is I get duplicate results. Now I have the same > data/information from the crawl copied on both machines so the crawl > data is > replicated on both machines. >> >> Questions: >> 1) how do I prevent the duplicate response? If I start a third search > server I only get two duplicate responses so it doesn't seem to > increase with the number of search servers > > In your query or in NutchBean set the hitsPerSite=1, here is an > example: > > Duplicates: > http://search.isc.swlabs.org/search.jsp?lang=en&query=java > > No Duplicates: > http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1 > > This is based on hostname so for instance java.net and www.java.net > will > be considered different even though they are the same. The latter > problem has not been corrected yet in Nutch, but we are working on it. > >> 2) does tomcat wait for ALL search servers to respond before > displaying the query result or does it display the result as soon as > one server > responds? > > Yes, to a timeout value. If one goes down it will slow down the entire > > search cluster. > >> 3) in terms of load sharing, what is the best approach for > distributed search servers? > > If you are looking at a round-robin sort of load balancing I would say > two nutch servers hitting different search servers with replicated > content fronted by an apache server or hardware load balancer. > Remember > that the entire search can still be up even if one or more search > servers fail. I would worry more about clustering the front end search > > website than load balancing the search servers but it all depends on > what your goal is. For a www search we don't care if a few of the > search servers are down as long as the search is functional. > > Dennis Kubes > > >> >> Any help would be greatly appreciated! >> >> Thanks, >> >> Hilkiah G. Lavinier MEng (Hons), ACGI >> 6 Winston Lane, >> Goodwill, >> Roseau, Dominica >> Mbl: (767) 275 3382 >> Hm : (767) 440 3924 >> Fax: (767) 440 4991 >> VoIP USA: (646) 432 4487 >> >> Email: [EMAIL PROTECTED] >> Email: [EMAIL PROTECTED] >> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] >> IM: ICQ #8978201 / AOL hilkiah21 >> >> >> >> >> >> > > > ____________________________________________________________________________________ >> Looking for last minute shopping deals? >> Find them fast with Yahoo! Search. > > http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > -- View this message in context: http://www.nabble.com/db.ignore.external.links-tp14982002p15518399.html Sent from the Nutch - User mailing list archive at Nabble.com.
