Hi Neera,

try fetcher2 instead of fetcher. In my experience, the fetcher2 implementations 
considers the "db.ignore.external.links" setting even for redirects, so that 
you don't need url-filters to limit a crawl to a certain domain.

Kind regards,
Martina


-----Ursprüngliche Nachricht-----
Von: Neera Sharma [mailto:neera.sha...@gmail.com] 
Gesendet: Freitag, 20. März 2009 23:51
An: nutch-user@lucene.apache.org
Betreff: db.ignore.external.links and urlfilters

Hi All,

I want to restrict a crawl to a domain specified in a input url.  I used the
*db.ignore.external.links* property(set to true), but
I found that links that are redirected outside the input url also got
crawled.  However if I set the regex-urlfilter.txt and crawl-urlfilter.txt
files,
I was able to avoid these extra urls and crawled more urls from the seed
domain. I expected that both these
approaches should give same results. Is it a bug?

Is there a way to fix this issue without setting urlfilters?

With changing filter files I need to edit them before crawling each domin
and also need to restart nutch. Is there a way I can change
these filter values at runtime ?

Thanks,
Neera

Reply via email to