NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl

I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to  force Nutch to fetch ONLY unfetched urls from crawldb?
You can fetch unfetched urls and those expired when you dont use -topN switch

-topN gets only those with the higher score

Bartosz

Reply via email to