I use this script to crawl and recrawl web: http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30) because it fetches the same urls several times in different recrawl loops. What I should do to force Nutch to fetch ONLY unfetched urls from crawldb? -- View this message in context: http://www.nabble.com/nutch-fetches-already-fetched-urls-again-and-again-tp22226407p22226407.html Sent from the Nutch - User mailing list archive at Nabble.com.
