Okay.  I'll dig a little deeper.  I saw a few scripts that people had
created, but I couldn't get them to work.

Thanks much.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Wednesday, December 09, 2009 4:05 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

I don't that you can use nutch crawl command to do that, this is a one
stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option
-adddays,
read that page on the wiki to get a feel how you should do:
http://wiki.apache.org/nutch/Crawl

2009/12/9 Peters, Vijaya <vijaya_pet...@sra.com>

> I didn't see a setting to override in crawl-urlfilter.  How do I set
> numberDays? I have regular expressions to include/exclude certain
extensions
> and certain urls, but that's all I have in there.
>
> Please send me an example and I'll give it a try.
>
> Thanks!
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive
> years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
proprietary.
>  The information is intended for the use of the individual or entity
named
> above.  If you are not the intended recipient, be aware that any
disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information
in
> error, please notify us immediately by telephone at 866-584-2143.
>
> -----Original Message-----
> From: xiao yang [mailto:yangxiao9...@gmail.com]
> Sent: Wednesday, December 09, 2009 1:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
>
> What about the configuration in crawl-urlfilter.txt?
>
> On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
<vijaya_pet...@sra.com>
> wrote:
> > I tried that too.
> > in Nutch-site.xml, I added in the below, but this had no effect.
> >
> > <property>
> >  <name>db.default.fetch.interval</name>
> >  <value>0</value>
> >  <description>(DEPRECATED) The default number of days between
re-fetches
> of a page.  value was 30
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.default</name>
> >  <value>3600</value>
> >  <description>The default number of seconds between re-fetches of a
page
> (30 days). value was 2592000 (30 days)
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.max</name>
> >  <value>3600</value>
> >  <description>The maximum number of seconds between re-fetches of a
page
> >  (90 days). After this period every page in the db will be re-tried,
no
> >  matter what is its status.  value was 7776000
> >  </description>
> > </property>
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
proprietary.
>  The information is intended for the use of the individual or entity
named
> above.  If you are not the intended recipient, be aware that any
disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information
in
> error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: MilleBii [mailto:mille...@gmail.com]
> > Sent: Wednesday, December 09, 2009 1:27 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> >
> > Nutch only recrawl every 30 days by default. So you set the
numberDays
> > adequately and it wil recrawl read nutch-default.xml to get the
> > details
> >
> > 2009/12/9, xiao yang <yangxiao9...@gmail.com>:
> >> What do you mean by "recrawl"?
> >> Does the following command meets what you need?
> >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >> Change the destination directory to a different one with the last
crawl.
> >>
> >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
<vijaya_pet...@sra.com>
> >> wrote:
> >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> complete
> >>> recrawl?
> >>>
> >>>
> >>>
> >>> thanks,
> >>>
> >>> - Vijaya
> >>>
> >>>
> >>>
> >>> Vijaya Peters
> >>> SRA International, Inc.
> >>> 4350 Fair Lakes Court North
> >>> Room 4004
> >>> Fairfax, VA  22033
> >>> Tel:  703-502-1184
> >>>
> >>> www.sra.com <http://www.sra.com/>
> >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> >>> consecutive years
> >>>
> >>> P Please consider the environment before printing this e-mail
> >>>
> >>> This electronic message transmission contains information from SRA
> >>> International, Inc. which may be confidential, privileged or
> >>> proprietary.  The information is intended for the use of the
individual
> >>> or entity named above.  If you are not the intended recipient, be
aware
> >>> that any disclosure, copying, distribution, or use of the contents
of
> >>> this information is strictly prohibited.  If you have received
this
> >>> electronic information in error, please notify us immediately by
> >>> telephone at 866-584-2143.
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Reply via email to