Hello,

I am experiencing a similar problem with "db.ignore.external.links". Did You
find any solution?

Best,
Oleg Mürk


Hilkiah Lavinier wrote:
> 
> Hi I need to better understand the impact of the db.ignore.external.links
> property.
> 
> I have this set to true in my nutch-site.xml file.  Based on the
> description, I expect that links to sites not included in the initial
> inject list won't get indexed. However after running a -depth 10 from an
> initial list of 15 sites, nutch has indexed (confirmed from searching with
> tomcat) hundreds of sites that were NOT included in the initial seed list. 
> How come?  Is there some other option that I must set to say "only index
> the pages for the sites included in the initially supplied seed list".
> 
> For whats its worth I'm using the urlfilter-suffix instead of the
> urlfilter-regex since I read somewhere that the regex filter causes
> crashes and the suffix one is more stable etc.
> 
> Thanks,
> 
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>  
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201  / AOL hilkiah21
> 
> ----- Original Message ----
> From: Hilkiah Lavinier <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, January 19, 2008 8:35:18 PM
> Subject: Re: distributed search servers
> 
> 
> Thanks for the quick response.
> 
> Dennis, I'm not sure how to change the setting in the NutchBean,
>  however I set the variable int hitsPerSite in search.jsp instead.
> 
> On a performance note, do you recommend loading the indexes directory
>  in ram (tmpfs on linux) to reduce IO and increase performance?  I guess
>  it depends on how large the index is and how much ram is available,
>  however it sounds like a too good to be true method of squeezing out
> extra
>  performance from a nutch web server.  Your thoughts pls.
> 
> 
> Regards,
>  
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>  
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201  / AOL hilkiah21
> 
> ----- Original Message ----
> From: Dennis Kubes <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, January 19, 2008 7:24:03 PM
> Subject: Re: distributed search servers
> 
> 
> 
> 
> Hilkiah Lavinier wrote:
>> Hi all,
>> 
>> Have a distributed search issue I need some advice on.  The scenario
>  is that I have tomcat running off one server and two nutch search
>  servers running off two other machines (so 3 machines in total).  I've
>  setup
>  the nutch war to correctly call the search servers and they respond.
>   Problem is I get duplicate results.  Now I have the same
>  data/information from the crawl copied on both machines so the crawl
>  data is
>  replicated on both machines.
>> 
>> Questions:
>> 1) how do I prevent the duplicate response? If I start a third search
>  server I only get two duplicate responses so it doesn't seem to
>  increase with the number of search servers
> 
> In your query or in NutchBean set the hitsPerSite=1, here is an
>  example:
> 
> Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java
> 
> No Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
> 
> This is based on hostname so for instance java.net and www.java.net
>  will 
> be considered different even though they are the same.  The latter 
> problem has not been corrected yet in Nutch, but we are working on it.
> 
>> 2) does tomcat wait for ALL search servers to respond before
>  displaying the query result or does it display the result as soon as
>  one server
>  responds?
> 
> Yes, to a timeout value.  If one goes down it will slow down the entire
>  
> search cluster.
> 
>> 3) in terms of load sharing, what is the best approach for
>  distributed search servers?
> 
> If you are looking at a round-robin sort of load balancing I would say 
> two nutch servers hitting different search servers with replicated 
> content fronted by an apache server or hardware load balancer.
>   Remember 
> that the entire search can still be up even if one or more search 
> servers fail.  I would worry more about clustering the front end search
>  
> website than load balancing the search servers but it all depends on 
> what your goal is.  For a www search we don't care if a few of the 
> search servers are down as long as the search is functional.
> 
> Dennis Kubes
> 
> 
>> 
>> Any help would be greatly appreciated!
>> 
>> Thanks,
>> 
>> Hilkiah G. Lavinier MEng (Hons), ACGI 
>> 6 Winston Lane, 
>> Goodwill, 
>> Roseau, Dominica 
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>  
>> Email: [EMAIL PROTECTED]
>> Email: [EMAIL PROTECTED]
>> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
>> IM: ICQ #8978201  / AOL hilkiah21
>> 
>> 
>> 
>> 
>> 
>>      
> 
> 
> ____________________________________________________________________________________
>> Looking for last minute shopping deals?  
>> Find them fast with Yahoo! Search.
> 
>   http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 
> 
> 
> 
> 
> 
>     
>  
> ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> 
> 
> 
> 
> 
> 
>      
> ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now. 
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/db.ignore.external.links-tp14982002p15518399.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to