I am not sure. But i think this wud be the reason

When u crawl the site the first time with a specified depth, the other urls
are detected

But the second time u crawl, the urls are already there and the depth is
relative to these new urls

In both the cases the depth is the same, but since depth is with proportion
to how deep it goes for each url, in second case it will be more as many
urls are already there.


Regards,
Prabhu



On 9/5/06, Andrei Hajdukewycz <[EMAIL PROTECTED]> wrote:

Hi,
I've crawled a site of roughly 30,000-40,000 pages using the
bin/nutch crawl command, which went quite smoothly. Now,
however, I'm trying to recrawl it using the script at
http://wiki.apache.org/nutch/IntranetRecrawl?action=show .

However, when I run the recrawl, generally I end up fetching
80-100k pages instead of 30-40k, with many pages fetched more
than once.

I assume this is due to the number of generate+fetch cycles I'm
running, which  is 5. I'm looking for advice on settings to optimize
this so I end up with less multiple fetching but still proper
coverage over the site.

"depth" as per the script is set to 5, topN unspecified, 31
days added to force refetch of everything.

My relevant settings nutch-site.xml are as follows:
db.ignore.internal.links = false,
db.ignore.external.links = true,
fetcher.server.delay = 1.0,
fetcher.threads.fetch = 3,
fetcher.threads.per.host = 3,
db.default.fetch.interval = 1

Any help would be most appreciated!
Andrei

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to