Re: nutch fetches already fetched urls again and again

Bartosz Gadzimski Thu, 26 Feb 2009 08:28:00 -0800

NutchDeveloper pisze:

I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl


I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to  force Nutch to fetch ONLY unfetched urls from crawldb?

You can fetch unfetched urls and those expired when you dont use -topNswitch


-topN gets only those with the higher score

Bartosz

Re: nutch fetches already fetched urls again and again

Reply via email to