Hi, I thought I have a very simple requirement. I just want to crawl a fixed set of 2.3M urls. Following the tutorial I injected the urls in the crawl db, generated a fetch list and started fetching. After 5 days I found it has fetched 3M pages and fetching is still going on. I stopped the process and now looking at the past posts in this group I just realized that I lost 5 days of crawl.
Why it fetched more pages than it has in the fetch list. Is it because I left the value of "db.max.outlinks.per.page" as 100. Also in the crawl command I didn't specify the "depth" parameter. Can somebody please help me in understanding the process. In case it is already discussed if possible please point me to the appropriate post. From this mailing list what I gathered that I should generate small set of fetch lists and merge the fetched contents. Since I my url set is fixed I don't want nutch to discover new urls. My understanding is "./bin/nutch updatedb" will discover new urls and next time I do "./bin/nutch generate" it will add those discovered urls in the fetch list. Given that I just want to crawl my fixed list of urls what is the best way to do that. Thanks in advance, -Som PS: I'm using nutch-0.9 in case that is required
