RE : Problem with crawl and recrawl

José Mestre Mon, 01 Dec 2008 16:54:42 -0800

Here is the result with a recrawl:

CrawlDb statistics start: crawl_fetcher/crawldb
Statistics for CrawlDb: crawl_fetcher/crawldb
TOTAL urls:     3266
retry 0:        3266
min score:      0.19
avg score:      1.0285031
max score:      10.229
status 1 (DB_unfetched):        3204
status 2 (DB_fetched):  62
CrawlDb statistics: done

I don't understand why  urls are unfetched ?

Regards.

Jo
________________________________________
De : José Mestre [EMAIL PROTECTED]
Date d'envoi : lundi 1 décembre 2008 19:01
À : [email protected]
Objet : RE: Problem with crawl and recrawl

Hi,

I use the script and I've already tried line by line.
Yes after the fetch I do an updatedb, and after I do a fetch again, ... as many 
fetch as depth value.
I've tried using updatedb with -noAdditions option as mentioned in a script 
description but no success.

Regards.

Jo

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: lundi 1 décembre 2008 18:48
To: [email protected]
Subject: Re: Problem with crawl and recrawl

When you do the generate, fetch commands, are you doing and updatedb
command also and then multiple generate and fetch cycles?  The depth 3
parameter automates this on the crawl command.

Dennis

José Mestre wrote:
> Hi,
>
> I'm using nutch to index part of an intranet website.
>
> When I use the "crawl" command the database indexes 3000 documents:
> e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3
> But when I do the same with the separate "generate, fetch, ..." commands I 
> just have 50 documents in the database:
> e.g.: for example the crawl or recrawl script with adddays=31
> http://wiki.apache.org/nutch/Crawl
> http://wiki.apache.org/nutch/IntranetRecrawl
> I've tried using  fetch with option -noAdditions
>
> Do someone know why this happen ?
>
> I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok.
>
> Regards.
>
> Jo
>
>

RE : Problem with crawl and recrawl

Reply via email to