yes just add those config in the nutch-site.xml and it should work.   but are 
you going to recrawl every hour ??? i see 3600 secondes !!

another thing is  you have to make an initial clean crawl with the new 
fetchtime , because in the crawldb it will not change the fetch time 
automaticly . (in my case it didnt change, i just deleted the crawldb and made 
a clean crawl and it works)
mabe someone can tell you how to change the fecthtime in the crawldb without 
deleting it for an intial clean crawl.

thx


> Subject: RE: how to force nutch to do a recrawl
> Date: Mon, 14 Dec 2009 11:26:31 -0500
> From: [email protected]
> To: [email protected]
> 
> Adam,
> I finally go the command to work on another server (see below).  to
> change the retry interval, should I just add the two properties into
> nutch-site.xml (though I tried this before and it didn't work):
> 
> http://mysite/        Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Jan 08 15:42:33 EST 2010  
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)  
> Score: 1.0
> Signature: e04ab1ac06075fc273dbe1334a6c6dc5
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> <property>
> <name>db.fetch.interval.default</name>
> <value>3600</value>
> <description>The default number of seconds between re-fetches of 
> a page 30 days). 
> </description>
> </property>
> 
> <property>
> <name>db.fetch.interval.max</name>
> <value>3600</value>
> <description>The maximum number of seconds between re-fetches of 
> a page(90 days). After this period every page in the db will be 
> re-tried, no matter what is its status.  </description> 
> </property>
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:[email protected]] 
> Sent: Friday, December 11, 2009 3:11 PM
> To: [email protected]
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> 
> you shouldnt open the crc file you have to open the other one, which is
> part-00000.
> use vi top edit part-0000.
> if you will not find this file so your dump failed...just check the
> logs/hadoop.log file
> 
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Fri, 11 Dec 2009 09:14:26 -0500
> > From: [email protected]
> > To: [email protected]
> > 
> > Adam,
> > I'm using cygwin to run the scripts.  I use EditPlus to edit the
> files.  But EditPlus won't allow me to edit the crc file.  I'll see if I
> can ftp the file to a unix machine.
> > 
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 12500 Fair Lakes Circle
> > Room 3507
> > Fairfax, VA 22033
> > Tel:  703-222-9207
> > 
> > www.sra.com
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> > 
> > 
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:[email protected]]
> > Sent: Thu 12/10/2009 6:43 PM
> > To: [email protected]
> > Subject: RE: how to force nutch to do a recrawl
> >  
> > 
> > 
> > bu8t how you are running sh scripts...
> > you have to use cygwin to be able to edit linux files
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > > From: [email protected]
> > > To: [email protected]
> > > 
> > > Adam,
> > > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > > recognize vi.  Any idea for opening it in windows?  Notepad didn't
> work
> > > either.
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> individual
> > > or entity named above.  If you are not the intended recipient, be
> aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:[email protected]] 
> > > Sent: Thursday, December 10, 2009 4:01 PM
> > > To: [email protected]
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > jus use vi or vim
> > > 
> > > 
> > > i use vi to edit the file
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > > From: [email protected]
> > > > To: [email protected]
> > > > 
> > > > Adam,
> > > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > > advance!
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:[email protected]] 
> > > > Sent: Thursday, December 10, 2009 3:48 PM
> > > > To: [email protected]
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > it will not dump to the console !
> > > > whole_db is a folder and you have to edit the file you will find
> in
> > > this
> > > > folder
> > > > 
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > > From: [email protected]
> > > > > To: [email protected]
> > > > > 
> > > > > Adam,
> > > > > I tried running that command and get the following (it created a
> > > > > whole_db directory, but it's not dumping out the contents to the
> > > > > console):
> > > > > 
> > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > > CrawlDb dump: starting
> > > > > CrawlDb db: crawl/crawldb/
> > > > > CrawlDb dump: done
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > > -----Original Message-----
> > > > > From: BELLINI ADAM [mailto:[email protected]] 
> > > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > > To: [email protected]
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > 
> > > > > 
> > > > > hi,
> > > > > check the fetch time in your crawldb...you can dump all the
> crawldb
> > > > like
> > > > > this:
> > > > > 
> > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > > 
> > > > > entries will look like this:
> > > > > 
> > > > > http://www.YOUR_URL_TO_FETCH
> > > > > Status: 2 (db_fetched)
> > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > Retries since fetch: 0
> > > > > Retry interval: 18000 seconds (0 days)
> > > > > Score: 0.0014977538
> > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > > Metadata: _pst_: success(1), lastModified=0
> > > > > 
> > > > > 
> > > > > as you see the next time the page will be fetched is in fetch
> time
> > > :
> > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > > and check the rety interval : it should be your 3600. 
> > > > > 
> > > > > hope it will help
> > > > > 
> > > > > 
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > > From: [email protected]
> > > > > > To: [email protected]
> > > > > > 
> > > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
> people
> > > > had
> > > > > > created, but I couldn't get them to work.
> > > > > > 
> > > > > > Thanks much.
> > > > > > 
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > > 
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > > aware
> > > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > > of
> > > > > > this information is strictly prohibited.  If you have received
> > > this
> > > > > > electronic information in error, please notify us immediately
> by
> > > > > > telephone at 866-584-2143.
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:[email protected]] 
> > > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > 
> > > > > > I don't that you can use nutch crawl command to do that, this
> is a
> > > > one
> > > > > > stop
> > > > > > shop command.
> > > > > > You probably want to use individual commands.
> > > > > > Type nutch generate to get the help and you will see the
> option
> > > > > > -adddays,
> > > > > > read that page on the wiki to get a feel how you should do:
> > > > > > http://wiki.apache.org/nutch/Crawl
> > > > > > 
> > > > > > 2009/12/9 Peters, Vijaya <[email protected]>
> > > > > > 
> > > > > > > I didn't see a setting to override in crawl-urlfilter.  How
> do I
> > > > set
> > > > > > > numberDays? I have regular expressions to include/exclude
> > > certain
> > > > > > extensions
> > > > > > > and certain urls, but that's all I have in there.
> > > > > > >
> > > > > > > Please send me an example and I'll give it a try.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive
> > > > > > > years
> > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > This electronic message transmission contains information
> from
> > > SRA
> > > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: xiao yang [mailto:[email protected]]
> > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > > >
> > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > > <[email protected]>
> > > > > > > wrote:
> > > > > > > > I tried that too.
> > > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > > effect.
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > > >  <value>0</value>
> > > > > > > >  <description>(DEPRECATED) The default number of days
> between
> > > > > > re-fetches
> > > > > > > of a page.  value was 30
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The default number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > (30 days). value was 2592000 (30 days)
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The maximum number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > >  (90 days). After this period every page in the db will be
> > > > > re-tried,
> > > > > > no
> > > > > > > >  matter what is its status.  value was 7776000
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > Vijaya Peters
> > > > > > > > SRA International, Inc.
> > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > Room 4004
> > > > > > > > Fairfax, VA  22033
> > > > > > > > Tel:  703-502-1184
> > > > > > > >
> > > > > > > > www.sra.com
> > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > > 10
> > > > > > > consecutive years
> > > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > > This electronic message transmission contains information
> from
> > > > SRA
> > > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: MilleBii [mailto:[email protected]]
> > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > > To: [email protected]
> > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > >
> > > > > > > > Nutch only recrawl every 30 days by default. So you set
> the
> > > > > > numberDays
> > > > > > > > adequately and it wil recrawl read nutch-default.xml to
> get
> > > the
> > > > > > > > details
> > > > > > > >
> > > > > > > > 2009/12/9, xiao yang <[email protected]>:
> > > > > > > >> What do you mean by "recrawl"?
> > > > > > > >> Does the following command meets what you need?
> > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > > >> Change the destination directory to a different one with
> the
> > > > last
> > > > > > crawl.
> > > > > > > >>
> > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > > <[email protected]>
> > > > > > > >> wrote:
> > > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch
> to
> > > do
> > > > a
> > > > > > > complete
> > > > > > > >>> recrawl?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> thanks,
> > > > > > > >>>
> > > > > > > >>> - Vijaya
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Vijaya Peters
> > > > > > > >>> SRA International, Inc.
> > > > > > > >>> 4350 Fair Lakes Court North
> > > > > > > >>> Room 4004
> > > > > > > >>> Fairfax, VA  22033
> > > > > > > >>> Tel:  703-502-1184
> > > > > > > >>>
> > > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > > > 10
> > > > > > > >>> consecutive years
> > > > > > > >>>
> > > > > > > >>> P Please consider the environment before printing this
> > > e-mail
> > > > > > > >>>
> > > > > > > >>> This electronic message transmission contains
> information
> > > from
> > > > > SRA
> > > > > > > >>> International, Inc. which may be confidential,
> privileged or
> > > > > > > >>> proprietary.  The information is intended for the use of
> the
> > > > > > individual
> > > > > > > >>> or entity named above.  If you are not the intended
> > > recipient,
> > > > > be
> > > > > > aware
> > > > > > > >>> that any disclosure, copying, distribution, or use of
> the
> > > > > contents
> > > > > > of
> > > > > > > >>> this information is strictly prohibited.  If you have
> > > received
> > > > > > this
> > > > > > > >>> electronic information in error, please notify us
> > > immediately
> > > > by
> > > > > > > >>> telephone at 866-584-2143.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -MilleBii-
> > > > > > > >
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > -MilleBii-
> > > > >                                         
> > > > >
> _________________________________________________________________
> > > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates
> when
> > > > they
> > > > > e-mail you.
> > > > > http://go.microsoft.com/?linkid=9691817
> > > >                                           
> > > > _________________________________________________________________
> > > > Windows Live: Make it easier for your friends to see what you're
> up to
> > > > on Facebook.
> > > > http://go.microsoft.com/?linkid=9691816
> > >                                     
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're up
> to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >                                       
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> > 
>                                         
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
                                          
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Reply via email to