Hi, Are you getting anything special in the log file? No anything special.
Yes I do that. Here is my script: echo "Inject" /opt/nutch-0.8.1/bin/nutch inject crawl_fetcher/crawldb urlsfetch echo "#Fetch1#" /opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb crawl_fetcher/segments -adddays 31 s1=`ls -d crawl_fetcher/segments/2* | tail -1` echo $s1 /opt/nutch-0.8.1/bin/nutch fetch $s1 -threads 500 /opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s1 -noAdditions echo "#Fetch2#" /opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb crawl_fetcher/segments s2=`ls -d crawl_fetcher/segments/2* | tail -1` echo $s2 /opt/nutch-0.8.1/bin/nutch fetch $s2 -threads 500 /opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s2 -noAdditions echo "#Fetch3#" /opt/nutch-0.8.1/bin/nutch generate crawl_fetcher/crawldb crawl_fetcher/segments s3=`ls -d crawl_fetcher/segments/2* | tail -1` echo $s3 /opt/nutch-0.8.1/bin/nutch fetch $s3 -threads 500 /opt/nutch-0.8.1/bin/nutch updatedb crawl_fetcher/crawldb $s3 -noAdditions # PREPARING CRAWL_DB echo "Invert Links" /opt/nutch-0.8.1/bin/nutch invertlinks crawl_fetcher/linkdb -dir crawl_fetcher/segments echo "Index Base" /opt/nutch-0.8.1/bin/nutch index crawl_fetcher/indexes crawl_fetcher/crawldb crawl_fetcher/linkdb crawl_fetcher/segments/* echo "# Tell Tomcat to reload index" /home/spibio/tomcat restart echo Stats /opt/nutch-0.8.1/bin/nutch readdb crawl_fetcher/crawldb -stats José Mestre -----Message d'origine----- De : Julien Nioche [mailto:[EMAIL PROTECTED] Envoyé : lundi 8 décembre 2008 18:22 À : [email protected] Objet : Re: RE : Problem with crawl and recrawl Bonjour Jose, Sorry if I am suggesting something obvious but after you've done the * updateDB* do you call *generate* to get a new segment? If so, do you then call *fetch* on that second segment? Are you getting anything special in the log file? Best, Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2008/12/8 José Mestre <[EMAIL PROTECTED]> > Hi again, I have no answer. > Why are my documents unfetched when I do a recrawl please ? > > Thks. > > José Mestre > > -----Message d'origine----- > De : José Mestre [mailto:[EMAIL PROTECTED] Envoyé : mardi 2 > décembre 2008 14:07 À : [email protected] Objet : RE : RE : > Problem with crawl and recrawl > > Hi, > > 62 docs are in the index. > > José > > ________________________________________ > De : Alexander Aristov [EMAIL PROTECTED] Date d'envoi : > mardi > 2 décembre 2008 06:58 À : [email protected] Objet : Re: RE : > Problem with crawl and recrawl > > Maybe silly question but > > How to know how many docs are in the index? > > thanks > Alex > > 2008/12/2 José Mestre <[EMAIL PROTECTED]> > > > Here is the result with a recrawl: > > > > CrawlDb statistics start: crawl_fetcher/crawldb Statistics for > > CrawlDb: crawl_fetcher/crawldb > > TOTAL urls: 3266 > > retry 0: 3266 > > min score: 0.19 > > avg score: 1.0285031 > > max score: 10.229 > > status 1 (DB_unfetched): 3204 > > status 2 (DB_fetched): 62 > > CrawlDb statistics: done > > > > I don't understand why urls are unfetched ? > > > > Regards. > > > > Jo > > ________________________________________ > > De : José Mestre [EMAIL PROTECTED] Date d'envoi : lundi 1 > > décembre 2008 19:01 À : [email protected] Objet : RE: > > Problem with crawl and recrawl > > > > Hi, > > > > I use the script and I've already tried line by line. > > Yes after the fetch I do an updatedb, and after I do a fetch again, > > ... as many fetch as depth value. > > I've tried using updatedb with -noAdditions option as mentioned in a > > script description but no success. > > > > Regards. > > > > Jo > > > > -----Original Message----- > > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > > Sent: lundi 1 décembre 2008 18:48 > > To: [email protected] > > Subject: Re: Problem with crawl and recrawl > > > > When you do the generate, fetch commands, are you doing and updatedb > > command also and then multiple generate and fetch cycles? The depth > > 3 parameter automates this on the crawl command. > > > > Dennis > > > > José Mestre wrote: > > > Hi, > > > > > > I'm using nutch to index part of an intranet website. > > > > > > When I use the "crawl" command the database indexes 3000 documents: > > > e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3 But when I > > > do the same with the separate "generate, fetch, ..." commands > > I just have 50 documents in the database: > > > e.g.: for example the crawl or recrawl script with adddays=31 > > > http://wiki.apache.org/nutch/Crawl > > > http://wiki.apache.org/nutch/IntranetRecrawl > > > I've tried using fetch with option -noAdditions > > > > > > Do someone know why this happen ? > > > > > > I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok. > > > > > > Regards. > > > > > > Jo > > > > > > > > > > > > -- > Best Regards > Alexander Aristov > > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.176 / Virus Database: 270.9.15/1835 - Release Date: > 07/12/2008 > 16:56 > No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.176 / Virus Database: 270.9.15/1835 - Release Date: 07/12/2008 16:56
