RE: RE : Problem with crawl and recrawl

José Mestre Mon, 08 Dec 2008 08:48:55 -0800

Hi again, I have no answer.
Why are my documents unfetched when I do a recrawl please ?


Thks.

José Mestre

-----Message d'origine-----
De : José Mestre [mailto:[EMAIL PROTECTED] 
Envoyé : mardi 2 décembre 2008 14:07
À : [email protected]
Objet : RE : RE : Problem with crawl and recrawl

Hi,

62 docs are in the index.

José

________________________________________
De : Alexander Aristov [EMAIL PROTECTED] Date d'envoi : mardi 2 décembre 2008 
06:58 À : [email protected] Objet : Re: RE : Problem with crawl and 
recrawl

Maybe silly question but

How to know how many docs are in the index?

thanks
Alex

2008/12/2 José Mestre <[EMAIL PROTECTED]>

> Here is the result with a recrawl:
>
> CrawlDb statistics start: crawl_fetcher/crawldb Statistics for 
> CrawlDb: crawl_fetcher/crawldb
> TOTAL urls:     3266
> retry 0:        3266
> min score:      0.19
> avg score:      1.0285031
> max score:      10.229
> status 1 (DB_unfetched):        3204
> status 2 (DB_fetched):  62
> CrawlDb statistics: done
>
> I don't understand why  urls are unfetched ?
>
> Regards.
>
> Jo
> ________________________________________
> De : José Mestre [EMAIL PROTECTED] Date d'envoi : lundi 1 
> décembre 2008 19:01 À : [email protected] Objet : RE: 
> Problem with crawl and recrawl
>
> Hi,
>
> I use the script and I've already tried line by line.
> Yes after the fetch I do an updatedb, and after I do a fetch again, 
> ... as many fetch as depth value.
> I've tried using updatedb with -noAdditions option as mentioned in a 
> script description but no success.
>
> Regards.
>
> Jo
>
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> Sent: lundi 1 décembre 2008 18:48
> To: [email protected]
> Subject: Re: Problem with crawl and recrawl
>
> When you do the generate, fetch commands, are you doing and updatedb 
> command also and then multiple generate and fetch cycles?  The depth 3 
> parameter automates this on the crawl command.
>
> Dennis
>
> José Mestre wrote:
> > Hi,
> >
> > I'm using nutch to index part of an intranet website.
> >
> > When I use the "crawl" command the database indexes 3000 documents:
> > e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3 But when I 
> > do the same with the separate "generate, fetch, ..." commands
> I just have 50 documents in the database:
> > e.g.: for example the crawl or recrawl script with adddays=31 
> > http://wiki.apache.org/nutch/Crawl
> > http://wiki.apache.org/nutch/IntranetRecrawl
> > I've tried using  fetch with option -noAdditions
> >
> > Do someone know why this happen ?
> >
> > I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok.
> >
> > Regards.
> >
> > Jo
> >
> >
>



--
Best Regards
Alexander Aristov

No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.176 / Virus Database: 270.9.15/1835 - Release Date: 07/12/2008 
16:56

RE: RE : Problem with crawl and recrawl

Reply via email to