hi,
check the fetch time in your crawldb...you can dump all the crawldb like this:

./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db

entries will look like this:

http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 0.0014977538
Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
Metadata: _pst_: success(1), lastModified=0


as you see the next time the page will be fetched is in fetch time  : 'Fetch 
time: Thu Dec 10 09:19:18 EST 2009'
and check the rety interval : it should be your 3600. 

hope it will help


> Subject: RE: how to force nutch to do a recrawl
> Date: Wed, 9 Dec 2009 16:06:58 -0500
> From: [email protected]
> To: [email protected]
> 
> Okay.  I'll dig a little deeper.  I saw a few scripts that people had
> created, but I couldn't get them to work.
> 
> Thanks much.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: MilleBii [mailto:[email protected]] 
> Sent: Wednesday, December 09, 2009 4:05 PM
> To: [email protected]
> Subject: Re: how to force nutch to do a recrawl
> 
> I don't that you can use nutch crawl command to do that, this is a one
> stop
> shop command.
> You probably want to use individual commands.
> Type nutch generate to get the help and you will see the option
> -adddays,
> read that page on the wiki to get a feel how you should do:
> http://wiki.apache.org/nutch/Crawl
> 
> 2009/12/9 Peters, Vijaya <[email protected]>
> 
> > I didn't see a setting to override in crawl-urlfilter.  How do I set
> > numberDays? I have regular expressions to include/exclude certain
> extensions
> > and certain urls, but that's all I have in there.
> >
> > Please send me an example and I'll give it a try.
> >
> > Thanks!
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive
> > years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: xiao yang [mailto:[email protected]]
> > Sent: Wednesday, December 09, 2009 1:41 PM
> > To: [email protected]
> > Subject: Re: how to force nutch to do a recrawl
> >
> > What about the configuration in crawl-urlfilter.txt?
> >
> > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> <[email protected]>
> > wrote:
> > > I tried that too.
> > > in Nutch-site.xml, I added in the below, but this had no effect.
> > >
> > > <property>
> > >  <name>db.default.fetch.interval</name>
> > >  <value>0</value>
> > >  <description>(DEPRECATED) The default number of days between
> re-fetches
> > of a page.  value was 30
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.default</name>
> > >  <value>3600</value>
> > >  <description>The default number of seconds between re-fetches of a
> page
> > (30 days). value was 2592000 (30 days)
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.max</name>
> > >  <value>3600</value>
> > >  <description>The maximum number of seconds between re-fetches of a
> page
> > >  (90 days). After this period every page in the db will be re-tried,
> no
> > >  matter what is its status.  value was 7776000
> > >  </description>
> > > </property>
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: MilleBii [mailto:[email protected]]
> > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > To: [email protected]
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > Nutch only recrawl every 30 days by default. So you set the
> numberDays
> > > adequately and it wil recrawl read nutch-default.xml to get the
> > > details
> > >
> > > 2009/12/9, xiao yang <[email protected]>:
> > >> What do you mean by "recrawl"?
> > >> Does the following command meets what you need?
> > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > >> Change the destination directory to a different one with the last
> crawl.
> > >>
> > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> <[email protected]>
> > >> wrote:
> > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> > complete
> > >>> recrawl?
> > >>>
> > >>>
> > >>>
> > >>> thanks,
> > >>>
> > >>> - Vijaya
> > >>>
> > >>>
> > >>>
> > >>> Vijaya Peters
> > >>> SRA International, Inc.
> > >>> 4350 Fair Lakes Court North
> > >>> Room 4004
> > >>> Fairfax, VA  22033
> > >>> Tel:  703-502-1184
> > >>>
> > >>> www.sra.com <http://www.sra.com/>
> > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > >>> consecutive years
> > >>>
> > >>> P Please consider the environment before printing this e-mail
> > >>>
> > >>> This electronic message transmission contains information from SRA
> > >>> International, Inc. which may be confidential, privileged or
> > >>> proprietary.  The information is intended for the use of the
> individual
> > >>> or entity named above.  If you are not the intended recipient, be
> aware
> > >>> that any disclosure, copying, distribution, or use of the contents
> of
> > >>> this information is strictly prohibited.  If you have received
> this
> > >>> electronic information in error, please notify us immediately by
> > >>> telephone at 866-584-2143.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> >
> 
> 
> 
> -- 
> -MilleBii-
                                          
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail 
you.
http://go.microsoft.com/?linkid=9691817

Reply via email to