Thanks - perhaps I misunderstand the depth and topN commands.. My understanding of the depth command is that Nutch will only go X deep in the URL's to find websites - if I can that depth later does that mean it will go deeper at a later point in time? I thought it would continue ignoring URL's at that depth once it was told a higher depth? In other words, if I run a crawl with a depth of 2 and then a week later run a depth of 4, and then perhaps a couple of weeks later run a depth of 6 will that work?
Finally, the topN command - does that mean to only select the 1000 "best" URL's this *particular* crawl but in the *next* crawl pick another 1000 to match? I guess on both of these commands I was under the impression that large chunks of websites would never get crawled no matter how many times I went back to crawl it....? Thanks very much for the clarification... Paul -----Original Message----- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 05, 2008 10:36 PM To: nutch-user@lucene.apache.org Subject: Re: Limiting Crawl Time Did you try specifying a topN value? -depth 3 -topN 1000 should be close to what you want. On 2/6/08, Paul Stewart <[EMAIL PROTECTED]> wrote: > Hi folks... > > What is the best way to say limit crawling to perhaps 3-4 hours per day? > Is there a way to do this? > > Right now, I have a crawl depth of 6 and maximum per site of 100. I > thought this would limit things pretty low but during some test crawls, > my last crawl took 2.5 days to complete: > > Statistics for CrawlDb: crawl/crawldb > TOTAL urls: 1566612 > retry 0: 1549310 > retry 1: 12814 > retry 2: 1601 > retry 3: 2887 > min score: 0.0 > avg score: 0.037 > max score: 429.15 > status 1 (db_unfetched): 1021400 > status 2 (db_fetched): 446907 > status 3 (db_gone): 74420 > status 4 (db_redir_temp): 13861 > status 5 (db_redir_perm): 10024 > CrawlDb statistics: done > > > What I would like to do is crawl for 3-4 hours per day at most to > gradually fill the index.... thoughts? > > Thanks very much, > > Paul > > > > > > ------------------------------------------------------------------------ ---- > > "The information transmitted is intended only for the person or entity to > which it is addressed and contains confidential and/or privileged material. > If you received this in error, please contact the sender immediately and > then destroy this transmission, including all attachments, without copying, > distributing or disclosing same. Thank you." > -- Sent from Gmail for mobile | mobile.google.com ---------------------------------------------------------------------------- "The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."