RE : RE : Problem with crawl and recrawl

José Mestre Tue, 02 Dec 2008 05:08:44 -0800

Hi,

62 docs are in the index.


José

________________________________________
De : Alexander Aristov [EMAIL PROTECTED]
Date d'envoi : mardi 2 décembre 2008 06:58
À : [email protected]
Objet : Re: RE : Problem with crawl and recrawl

Maybe silly question but

How to know how many docs are in the index?

thanks
Alex

2008/12/2 José Mestre <[EMAIL PROTECTED]>

> Here is the result with a recrawl:
>
> CrawlDb statistics start: crawl_fetcher/crawldb
> Statistics for CrawlDb: crawl_fetcher/crawldb
> TOTAL urls:     3266
> retry 0:        3266
> min score:      0.19
> avg score:      1.0285031
> max score:      10.229
> status 1 (DB_unfetched):        3204
> status 2 (DB_fetched):  62
> CrawlDb statistics: done
>
> I don't understand why  urls are unfetched ?
>
> Regards.
>
> Jo
> ________________________________________
> De : José Mestre [EMAIL PROTECTED]
> Date d'envoi : lundi 1 décembre 2008 19:01
> À : [email protected]
> Objet : RE: Problem with crawl and recrawl
>
> Hi,
>
> I use the script and I've already tried line by line.
> Yes after the fetch I do an updatedb, and after I do a fetch again, ... as
> many fetch as depth value.
> I've tried using updatedb with -noAdditions option as mentioned in a script
> description but no success.
>
> Regards.
>
> Jo
>
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> Sent: lundi 1 décembre 2008 18:48
> To: [email protected]
> Subject: Re: Problem with crawl and recrawl
>
> When you do the generate, fetch commands, are you doing and updatedb
> command also and then multiple generate and fetch cycles?  The depth 3
> parameter automates this on the crawl command.
>
> Dennis
>
> José Mestre wrote:
> > Hi,
> >
> > I'm using nutch to index part of an intranet website.
> >
> > When I use the "crawl" command the database indexes 3000 documents:
> > e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3
> > But when I do the same with the separate "generate, fetch, ..." commands
> I just have 50 documents in the database:
> > e.g.: for example the crawl or recrawl script with adddays=31
> > http://wiki.apache.org/nutch/Crawl
> > http://wiki.apache.org/nutch/IntranetRecrawl
> > I've tried using  fetch with option -noAdditions
> >
> > Do someone know why this happen ?
> >
> > I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok.
> >
> > Regards.
> >
> > Jo
> >
> >
>



--
Best Regards
Alexander Aristov

RE : RE : Problem with crawl and recrawl

Reply via email to