it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this folder



> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 14:26:30 -0500
> From: [email protected]
> To: [email protected]
> 
> Adam,
> I tried running that command and get the following (it created a
> whole_db directory, but it's not dumping out the contents to the
> console):
> 
> $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> CrawlDb dump: starting
> CrawlDb db: crawl/crawldb/
> CrawlDb dump: done
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> -----Original Message-----
> From: BELLINI ADAM [mailto:[email protected]] 
> Sent: Thursday, December 10, 2009 1:40 PM
> To: [email protected]
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> check the fetch time in your crawldb...you can dump all the crawldb like
> this:
> 
> ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> 
> entries will look like this:
> 
> http://www.YOUR_URL_TO_FETCH
> Status: 2 (db_fetched)
> Fetch time: Thu Dec 10 09:19:18 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 0.0014977538
> Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> as you see the next time the page will be fetched is in fetch time  :
> 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> and check the rety interval : it should be your 3600. 
> 
> hope it will help
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > From: [email protected]
> > To: [email protected]
> > 
> > Okay.  I'll dig a little deeper.  I saw a few scripts that people had
> > created, but I couldn't get them to work.
> > 
> > Thanks much.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: MilleBii [mailto:[email protected]] 
> > Sent: Wednesday, December 09, 2009 4:05 PM
> > To: [email protected]
> > Subject: Re: how to force nutch to do a recrawl
> > 
> > I don't that you can use nutch crawl command to do that, this is a one
> > stop
> > shop command.
> > You probably want to use individual commands.
> > Type nutch generate to get the help and you will see the option
> > -adddays,
> > read that page on the wiki to get a feel how you should do:
> > http://wiki.apache.org/nutch/Crawl
> > 
> > 2009/12/9 Peters, Vijaya <[email protected]>
> > 
> > > I didn't see a setting to override in crawl-urlfilter.  How do I set
> > > numberDays? I have regular expressions to include/exclude certain
> > extensions
> > > and certain urls, but that's all I have in there.
> > >
> > > Please send me an example and I'll give it a try.
> > >
> > > Thanks!
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive
> > > years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: xiao yang [mailto:[email protected]]
> > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > To: [email protected]
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > What about the configuration in crawl-urlfilter.txt?
> > >
> > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > <[email protected]>
> > > wrote:
> > > > I tried that too.
> > > > in Nutch-site.xml, I added in the below, but this had no effect.
> > > >
> > > > <property>
> > > >  <name>db.default.fetch.interval</name>
> > > >  <value>0</value>
> > > >  <description>(DEPRECATED) The default number of days between
> > re-fetches
> > > of a page.  value was 30
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.default</name>
> > > >  <value>3600</value>
> > > >  <description>The default number of seconds between re-fetches of
> a
> > page
> > > (30 days). value was 2592000 (30 days)
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.max</name>
> > > >  <value>3600</value>
> > > >  <description>The maximum number of seconds between re-fetches of
> a
> > page
> > > >  (90 days). After this period every page in the db will be
> re-tried,
> > no
> > > >  matter what is its status.  value was 7776000
> > > >  </description>
> > > > </property>
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:[email protected]]
> > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > To: [email protected]
> > > > Subject: Re: how to force nutch to do a recrawl
> > > >
> > > > Nutch only recrawl every 30 days by default. So you set the
> > numberDays
> > > > adequately and it wil recrawl read nutch-default.xml to get the
> > > > details
> > > >
> > > > 2009/12/9, xiao yang <[email protected]>:
> > > >> What do you mean by "recrawl"?
> > > >> Does the following command meets what you need?
> > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > >> Change the destination directory to a different one with the last
> > crawl.
> > > >>
> > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > <[email protected]>
> > > >> wrote:
> > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> > > complete
> > > >>> recrawl?
> > > >>>
> > > >>>
> > > >>>
> > > >>> thanks,
> > > >>>
> > > >>> - Vijaya
> > > >>>
> > > >>>
> > > >>>
> > > >>> Vijaya Peters
> > > >>> SRA International, Inc.
> > > >>> 4350 Fair Lakes Court North
> > > >>> Room 4004
> > > >>> Fairfax, VA  22033
> > > >>> Tel:  703-502-1184
> > > >>>
> > > >>> www.sra.com <http://www.sra.com/>
> > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > >>> consecutive years
> > > >>>
> > > >>> P Please consider the environment before printing this e-mail
> > > >>>
> > > >>> This electronic message transmission contains information from
> SRA
> > > >>> International, Inc. which may be confidential, privileged or
> > > >>> proprietary.  The information is intended for the use of the
> > individual
> > > >>> or entity named above.  If you are not the intended recipient,
> be
> > aware
> > > >>> that any disclosure, copying, distribution, or use of the
> contents
> > of
> > > >>> this information is strictly prohibited.  If you have received
> > this
> > > >>> electronic information in error, please notify us immediately by
> > > >>> telephone at 866-584-2143.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > -MilleBii-
> > > >
> > >
> > 
> > 
> > 
> > -- 
> > -MilleBii-
>                                         
> _________________________________________________________________
> Windows Live: Friends get your Flickr, Yelp, and Digg updates when they
> e-mail you.
> http://go.microsoft.com/?linkid=9691817
                                          
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on 
Facebook.
http://go.microsoft.com/?linkid=9691816

Reply via email to