I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl

I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to  force Nutch to fetch ONLY unfetched urls from crawldb?
-- 
View this message in context: 
http://www.nabble.com/nutch-fetches-already-fetched-urls-again-and-again-tp22226407p22226407.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to