Crawling fixed set of urls (newbie question)

Somnath Banerjee Mon, 30 Apr 2007 08:12:26 -0700

Hi,

   I thought I have a very simple requirement. I just want to crawl a fixed
set of 2.3M urls. Following the tutorial I injected the urls in the crawl
db, generated a fetch list and started fetching. After 5 days I found it has
fetched 3M pages and fetching is still going on. I stopped the process and
now looking at the past posts in this group I just realized that I lost 5
days of crawl.


   Why it fetched more pages than it has in the fetch list. Is it because I
left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
command I didn't specify the "depth" parameter. Can somebody please help me
in understanding the process. In case it is already discussed if possible
please point me to the appropriate post.

   From this mailing list what I gathered that I should generate small set
of fetch lists and merge the fetched contents. Since I my url set is fixed I
don't want nutch to discover new urls. My understanding is  "./bin/nutch
updatedb" will discover new urls and next time I do "./bin/nutch generate"
it will add those discovered urls in the fetch list. Given that I just want
to crawl my fixed list of urls what is the best way to do that.

Thanks in advance,
-Som
PS: I'm using nutch-0.9 in case that is required

Crawling fixed set of urls (newbie question)

Reply via email to