RE: RE : Problem with crawl and recrawl

José Mestre Mon, 08 Dec 2008 09:35:56 -0800

Hi,

Are you getting anything special in the log file? No anything special.


Yes I do that.

Here is my script:

echo "Inject"
/opt/nutch-0.8.1/bin/nutch inject crawl_fetcher/crawldb urlsfetch

echo "#Fetch1#"
/opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb 
crawl_fetcher/segments -adddays 31

s1=`ls -d crawl_fetcher/segments/2* | tail -1`
echo $s1

/opt/nutch-0.8.1/bin/nutch fetch $s1 -threads 500

/opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s1 -noAdditions

echo "#Fetch2#"
/opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb crawl_fetcher/segments
s2=`ls -d crawl_fetcher/segments/2* | tail -1`
echo $s2

/opt/nutch-0.8.1/bin/nutch fetch $s2 -threads 500
/opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s2 -noAdditions

echo "#Fetch3#"
/opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb crawl_fetcher/segments
s3=`ls -d crawl_fetcher/segments/2* | tail -1`
echo $s3

/opt/nutch-0.8.1/bin/nutch fetch $s3 -threads 500
/opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s3 -noAdditions

# PREPARING CRAWL_DB
echo "Invert Links"
/opt/nutch-0.8.1/bin/nutch invertlinks crawl_fetcher/linkdb -dir 
crawl_fetcher/segments

echo "Index Base"
/opt/nutch-0.8.1/bin/nutch index crawl_fetcher/indexes crawl_fetcher/crawldb 
crawl_fetcher/linkdb crawl_fetcher/segments/*

echo "# Tell Tomcat to reload index"
/home/spibio/tomcat restart

echo Stats
/opt/nutch-0.8.1/bin/nutch readdb crawl_fetcher/crawldb -stats 


José Mestre

-----Message d'origine-----
De : Julien Nioche [mailto:[EMAIL PROTECTED] 
Envoyé : lundi 8 décembre 2008 18:22
À : [email protected]
Objet : Re: RE : Problem with crawl and recrawl

Bonjour Jose,

Sorry if I am suggesting something obvious but after you've done the *
updateDB* do you call *generate* to get a new segment? If so, do you then call 
*fetch* on that second segment? Are you getting anything special in the log 
file?

Best,

Julien

--
DigitalPebble Ltd
http://www.digitalpebble.com

2008/12/8 José Mestre <[EMAIL PROTECTED]>

> Hi again, I have no answer.
> Why are my documents unfetched when I do a recrawl please ?
>
> Thks.
>
> José Mestre
>
> -----Message d'origine-----
> De : José Mestre [mailto:[EMAIL PROTECTED] Envoyé : mardi 2 
> décembre 2008 14:07 À : [email protected] Objet : RE : RE : 
> Problem with crawl and recrawl
>
> Hi,
>
> 62 docs are in the index.
>
> José
>
> ________________________________________
> De : Alexander Aristov [EMAIL PROTECTED] Date d'envoi : 
> mardi
> 2 décembre 2008 06:58 À : [email protected] Objet : Re: RE :
> Problem with crawl and recrawl
>
> Maybe silly question but
>
> How to know how many docs are in the index?
>
> thanks
> Alex
>
> 2008/12/2 José Mestre <[EMAIL PROTECTED]>
>
> > Here is the result with a recrawl:
> >
> > CrawlDb statistics start: crawl_fetcher/crawldb Statistics for
> > CrawlDb: crawl_fetcher/crawldb
> > TOTAL urls:     3266
> > retry 0:        3266
> > min score:      0.19
> > avg score:      1.0285031
> > max score:      10.229
> > status 1 (DB_unfetched):        3204
> > status 2 (DB_fetched):  62
> > CrawlDb statistics: done
> >
> > I don't understand why  urls are unfetched ?
> >
> > Regards.
> >
> > Jo
> > ________________________________________
> > De : José Mestre [EMAIL PROTECTED] Date d'envoi : lundi 1 
> > décembre 2008 19:01 À : [email protected] Objet : RE:
> > Problem with crawl and recrawl
> >
> > Hi,
> >
> > I use the script and I've already tried line by line.
> > Yes after the fetch I do an updatedb, and after I do a fetch again, 
> > ... as many fetch as depth value.
> > I've tried using updatedb with -noAdditions option as mentioned in a 
> > script description but no success.
> >
> > Regards.
> >
> > Jo
> >
> > -----Original Message-----
> > From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> > Sent: lundi 1 décembre 2008 18:48
> > To: [email protected]
> > Subject: Re: Problem with crawl and recrawl
> >
> > When you do the generate, fetch commands, are you doing and updatedb 
> > command also and then multiple generate and fetch cycles?  The depth 
> > 3 parameter automates this on the crawl command.
> >
> > Dennis
> >
> > José Mestre wrote:
> > > Hi,
> > >
> > > I'm using nutch to index part of an intranet website.
> > >
> > > When I use the "crawl" command the database indexes 3000 documents:
> > > e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3 But when I 
> > > do the same with the separate "generate, fetch, ..." commands
> > I just have 50 documents in the database:
> > > e.g.: for example the crawl or recrawl script with adddays=31 
> > > http://wiki.apache.org/nutch/Crawl
> > > http://wiki.apache.org/nutch/IntranetRecrawl
> > > I've tried using  fetch with option -noAdditions
> > >
> > > Do someone know why this happen ?
> > >
> > > I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok.
> > >
> > > Regards.
> > >
> > > Jo
> > >
> > >
> >
>
>
>
> --
> Best Regards
> Alexander Aristov
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.176 / Virus Database: 270.9.15/1835 - Release Date: 
> 07/12/2008
> 16:56
>

No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.176 / Virus Database: 270.9.15/1835 - Release Date: 07/12/2008 
16:56

RE: RE : Problem with crawl and recrawl

Reply via email to