I can see it has only one segment (the segment created by ./bin/nutch generate). Any reason why it is crawling more pages than given in the fetchlist?
Thanks, Somnath On 5/1/07, qi wu <[EMAIL PROTECTED]> wrote:
How many segements were generated during your crawl ? If you have more than one segements, then some newly parsed outlinks in the page might be appended to crawldb. To prevent this,you can try updatedb with option "-noAddtions" in nutch91. ----- Original Message ----- From: "Somnath Banerjee" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Monday, April 30, 2007 11:12 PM Subject: Crawling fixed set of urls (newbie question) > Hi, > > I thought I have a very simple requirement. I just want to crawl a fixed > set of 2.3M urls. Following the tutorial I injected the urls in the crawl > db, generated a fetch list and started fetching. After 5 days I found it has > fetched 3M pages and fetching is still going on. I stopped the process and > now looking at the past posts in this group I just realized that I lost 5 > days of crawl. > > Why it fetched more pages than it has in the fetch list. Is it because I > left the value of "db.max.outlinks.per.page" as 100. Also in the crawl > command I didn't specify the "depth" parameter. Can somebody please help me > in understanding the process. In case it is already discussed if possible > please point me to the appropriate post. > > From this mailing list what I gathered that I should generate small set > of fetch lists and merge the fetched contents. Since I my url set is fixed I > don't want nutch to discover new urls. My understanding is "./bin/nutch > updatedb" will discover new urls and next time I do "./bin/nutch generate" > it will add those discovered urls in the fetch list. Given that I just want > to crawl my fixed list of urls what is the best way to do that. > > Thanks in advance, > -Som > PS: I'm using nutch-0.9 in case that is required >
