Thanks - perhaps I misunderstand the depth and topN commands..

My understanding of the depth command is that Nutch will only go X deep
in the URL's to find websites - if I can that depth later does that mean
it will go deeper at a later point in time?  I thought it would continue
ignoring URL's at that depth once it was told a higher depth?  In other
words, if I run a crawl with a depth of 2 and then a week later run a
depth of 4, and then perhaps a couple of weeks later run a depth of 6
will that work?

Finally, the topN command - does that mean to only select the 1000
"best" URL's this *particular* crawl but in the *next* crawl pick
another 1000 to match?

I guess on both of these commands I was under the impression that large
chunks of websites would never get crawled no matter how many times I
went back to crawl it....?

Thanks very much for the clarification...

Paul


-----Original Message-----
From: Susam Pal [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 05, 2008 10:36 PM
To: nutch-user@lucene.apache.org
Subject: Re: Limiting Crawl Time

Did you try specifying a topN value? -depth 3 -topN 1000 should be
close to what you want.

On 2/6/08, Paul Stewart <[EMAIL PROTECTED]> wrote:
> Hi folks...
>
> What is the best way to say limit crawling to perhaps 3-4 hours per
day?
> Is there a way to do this?
>
> Right now, I have a crawl depth of 6 and maximum per site of 100.  I
> thought this would limit things pretty low but during some test
crawls,
> my last crawl took 2.5 days to complete:
>
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     1566612
> retry 0:        1549310
> retry 1:        12814
> retry 2:        1601
> retry 3:        2887
> min score:      0.0
> avg score:      0.037
> max score:      429.15
> status 1 (db_unfetched):        1021400
> status 2 (db_fetched):  446907
> status 3 (db_gone):     74420
> status 4 (db_redir_temp):       13861
> status 5 (db_redir_perm):       10024
> CrawlDb statistics: done
>
>
> What I would like to do is crawl for 3-4 hours per day at most to
> gradually fill the index.... thoughts?
>
> Thanks very much,
>
> Paul
>
>
>
>
>
>
------------------------------------------------------------------------
----
>
> "The information transmitted is intended only for the person or entity
to
> which it is addressed and contains confidential and/or privileged
material.
> If you received this in error, please contact the sender immediately
and
> then destroy this transmission, including all attachments, without
copying,
> distributing or disclosing same. Thank you."
>

--
Sent from Gmail for mobile | mobile.google.com




----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which 
it is addressed and contains confidential and/or privileged material. If you 
received this in error, please contact the sender immediately and then destroy 
this transmission, including all attachments, without copying, distributing or 
disclosing same. Thank you."

Reply via email to