Hilkiah Lavinier wrote:
Hi I need to better understand the impact of the
db.ignore.external.links property.

Which version of Nutch is this?


I have this set to true in my nutch-site.xml file.  Based on the
description, I expect that links to sites not included in the initial
inject list won't get indexed. However after running a -depth 10 from
an initial list of 15 sites, nutch has indexed (confirmed from
searching with tomcat) hundreds of sites that were NOT included in
the initial seed list.  How come?  Is there some other option that I
must set to say "only index the pages for the sites included in the
initially supplied seed list".

No, this property should have the effect as you expected - if it doesn't work properly then it's a bug that needs to be fixed. Please be aware that certain aspects of redirect treatment have been changed recently - AFAIK the option should work correctly with the current code in trunk. The new urls outside the initial seed hosts may come from redirects to external hosts.



For whats its worth I'm using the urlfilter-suffix instead of the
urlfilter-regex since I read somewhere that the regex filter causes
crashes and the suffix one is more stable etc.

This shouldn't matter in this case.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to