Hilkiah Lavinier wrote:
Hi I need to better understand the impact of the db.ignore.external.links property.
Which version of Nutch is this?
I have this set to true in my nutch-site.xml file. Based on the description, I expect that links to sites not included in the initial inject list won't get indexed. However after running a -depth 10 from an initial list of 15 sites, nutch has indexed (confirmed from searching with tomcat) hundreds of sites that were NOT included in the initial seed list. How come? Is there some other option that I must set to say "only index the pages for the sites included in the initially supplied seed list".
No, this property should have the effect as you expected - if it doesn't work properly then it's a bug that needs to be fixed. Please be aware that certain aspects of redirect treatment have been changed recently - AFAIK the option should work correctly with the current code in trunk. The new urls outside the initial seed hosts may come from redirects to external hosts.
For whats its worth I'm using the urlfilter-suffix instead of the urlfilter-regex since I read somewhere that the regex filter causes crashes and the suffix one is more stable etc.
This shouldn't matter in this case. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
