db.ignore.external.links and urlfilters

Neera Sharma Fri, 20 Mar 2009 15:51:25 -0700

Hi All,

I want to restrict a crawl to a domain specified in a input url.  I used the
*db.ignore.external.links* property(set to true), but
I found that links that are redirected outside the input url also got
crawled.  However if I set the regex-urlfilter.txt and crawl-urlfilter.txt
files,
I was able to avoid these extra urls and crawled more urls from the seed
domain. I expected that both these
approaches should give same results. Is it a bug?


Is there a way to fix this issue without setting urlfilters?

With changing filter files I need to edit them before crawling each domin
and also need to restart nutch. Is there a way I can change
these filter values at runtime ?

Thanks,
Neera

db.ignore.external.links and urlfilters

Reply via email to