How many segements were generated during your crawl ? If you have more than one segements, then some newly parsed outlinks in the page might be appended to crawldb. To prevent this,you can try updatedb with option "-noAddtions" in nutch91.
----- Original Message ----- From: "Somnath Banerjee" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, April 30, 2007 11:12 PM Subject: Crawling fixed set of urls (newbie question) > Hi, > > I thought I have a very simple requirement. I just want to crawl a fixed > set of 2.3M urls. Following the tutorial I injected the urls in the crawl > db, generated a fetch list and started fetching. After 5 days I found it has > fetched 3M pages and fetching is still going on. I stopped the process and > now looking at the past posts in this group I just realized that I lost 5 > days of crawl. > > Why it fetched more pages than it has in the fetch list. Is it because I > left the value of "db.max.outlinks.per.page" as 100. Also in the crawl > command I didn't specify the "depth" parameter. Can somebody please help me > in understanding the process. In case it is already discussed if possible > please point me to the appropriate post. > > From this mailing list what I gathered that I should generate small set > of fetch lists and merge the fetched contents. Since I my url set is fixed I > don't want nutch to discover new urls. My understanding is "./bin/nutch > updatedb" will discover new urls and next time I do "./bin/nutch generate" > it will add those discovered urls in the fetch list. Given that I just want > to crawl my fixed list of urls what is the best way to do that. > > Thanks in advance, > -Som > PS: I'm using nutch-0.9 in case that is required > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
