hi, check the fetch time in your crawldb...you can dump all the crawldb like this:
./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help > Subject: RE: how to force nutch to do a recrawl > Date: Wed, 9 Dec 2009 16:06:58 -0500 > From: [email protected] > To: [email protected] > > Okay. I'll dig a little deeper. I saw a few scripts that people had > created, but I couldn't get them to work. > > Thanks much. > > Vijaya Peters > SRA International, Inc. > 4350 Fair Lakes Court North > Room 4004 > Fairfax, VA 22033 > Tel: 703-502-1184 > > www.sra.com > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > consecutive years > P Please consider the environment before printing this e-mail > This electronic message transmission contains information from SRA > International, Inc. which may be confidential, privileged or > proprietary. The information is intended for the use of the individual > or entity named above. If you are not the intended recipient, be aware > that any disclosure, copying, distribution, or use of the contents of > this information is strictly prohibited. If you have received this > electronic information in error, please notify us immediately by > telephone at 866-584-2143. > > -----Original Message----- > From: MilleBii [mailto:[email protected]] > Sent: Wednesday, December 09, 2009 4:05 PM > To: [email protected] > Subject: Re: how to force nutch to do a recrawl > > I don't that you can use nutch crawl command to do that, this is a one > stop > shop command. > You probably want to use individual commands. > Type nutch generate to get the help and you will see the option > -adddays, > read that page on the wiki to get a feel how you should do: > http://wiki.apache.org/nutch/Crawl > > 2009/12/9 Peters, Vijaya <[email protected]> > > > I didn't see a setting to override in crawl-urlfilter. How do I set > > numberDays? I have regular expressions to include/exclude certain > extensions > > and certain urls, but that's all I have in there. > > > > Please send me an example and I'll give it a try. > > > > Thanks! > > > > Vijaya Peters > > SRA International, Inc. > > 4350 Fair Lakes Court North > > Room 4004 > > Fairfax, VA 22033 > > Tel: 703-502-1184 > > > > www.sra.com > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > consecutive > > years > > P Please consider the environment before printing this e-mail > > This electronic message transmission contains information from SRA > > International, Inc. which may be confidential, privileged or > proprietary. > > The information is intended for the use of the individual or entity > named > > above. If you are not the intended recipient, be aware that any > disclosure, > > copying, distribution, or use of the contents of this information is > > strictly prohibited. If you have received this electronic information > in > > error, please notify us immediately by telephone at 866-584-2143. > > > > -----Original Message----- > > From: xiao yang [mailto:[email protected]] > > Sent: Wednesday, December 09, 2009 1:41 PM > > To: [email protected] > > Subject: Re: how to force nutch to do a recrawl > > > > What about the configuration in crawl-urlfilter.txt? > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya > <[email protected]> > > wrote: > > > I tried that too. > > > in Nutch-site.xml, I added in the below, but this had no effect. > > > > > > <property> > > > <name>db.default.fetch.interval</name> > > > <value>0</value> > > > <description>(DEPRECATED) The default number of days between > re-fetches > > of a page. value was 30 > > > </description> > > > </property> > > > > > > <property> > > > <name>db.fetch.interval.default</name> > > > <value>3600</value> > > > <description>The default number of seconds between re-fetches of a > page > > (30 days). value was 2592000 (30 days) > > > </description> > > > </property> > > > > > > <property> > > > <name>db.fetch.interval.max</name> > > > <value>3600</value> > > > <description>The maximum number of seconds between re-fetches of a > page > > > (90 days). After this period every page in the db will be re-tried, > no > > > matter what is its status. value was 7776000 > > > </description> > > > </property> > > > > > > Vijaya Peters > > > SRA International, Inc. > > > 4350 Fair Lakes Court North > > > Room 4004 > > > Fairfax, VA 22033 > > > Tel: 703-502-1184 > > > > > > www.sra.com > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > consecutive years > > > P Please consider the environment before printing this e-mail > > > This electronic message transmission contains information from SRA > > International, Inc. which may be confidential, privileged or > proprietary. > > The information is intended for the use of the individual or entity > named > > above. If you are not the intended recipient, be aware that any > disclosure, > > copying, distribution, or use of the contents of this information is > > strictly prohibited. If you have received this electronic information > in > > error, please notify us immediately by telephone at 866-584-2143. > > > > > > -----Original Message----- > > > From: MilleBii [mailto:[email protected]] > > > Sent: Wednesday, December 09, 2009 1:27 PM > > > To: [email protected] > > > Subject: Re: how to force nutch to do a recrawl > > > > > > Nutch only recrawl every 30 days by default. So you set the > numberDays > > > adequately and it wil recrawl read nutch-default.xml to get the > > > details > > > > > > 2009/12/9, xiao yang <[email protected]>: > > >> What do you mean by "recrawl"? > > >> Does the following command meets what you need? > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > >> Change the destination directory to a different one with the last > crawl. > > >> > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya > <[email protected]> > > >> wrote: > > >>> I'm running Nutch 1.0 in windows. How do I force Nutch to do a > > complete > > >>> recrawl? > > >>> > > >>> > > >>> > > >>> thanks, > > >>> > > >>> - Vijaya > > >>> > > >>> > > >>> > > >>> Vijaya Peters > > >>> SRA International, Inc. > > >>> 4350 Fair Lakes Court North > > >>> Room 4004 > > >>> Fairfax, VA 22033 > > >>> Tel: 703-502-1184 > > >>> > > >>> www.sra.com <http://www.sra.com/> > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > >>> consecutive years > > >>> > > >>> P Please consider the environment before printing this e-mail > > >>> > > >>> This electronic message transmission contains information from SRA > > >>> International, Inc. which may be confidential, privileged or > > >>> proprietary. The information is intended for the use of the > individual > > >>> or entity named above. If you are not the intended recipient, be > aware > > >>> that any disclosure, copying, distribution, or use of the contents > of > > >>> this information is strictly prohibited. If you have received > this > > >>> electronic information in error, please notify us immediately by > > >>> telephone at 866-584-2143. > > >>> > > >>> > > >>> > > >>> > > >> > > > > > > > > > -- > > > -MilleBii- > > > > > > > > > -- > -MilleBii- _________________________________________________________________ Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you. http://go.microsoft.com/?linkid=9691817
