jus use vi or vim
i use vi to edit the file > Subject: RE: how to force nutch to do a recrawl > Date: Thu, 10 Dec 2009 15:58:24 -0500 > From: [email protected] > To: [email protected] > > Adam, > What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! > > Vijaya Peters > SRA International, Inc. > 4350 Fair Lakes Court North > Room 4004 > Fairfax, VA 22033 > Tel: 703-502-1184 > > www.sra.com > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > consecutive years > P Please consider the environment before printing this e-mail > This electronic message transmission contains information from SRA > International, Inc. which may be confidential, privileged or > proprietary. The information is intended for the use of the individual > or entity named above. If you are not the intended recipient, be aware > that any disclosure, copying, distribution, or use of the contents of > this information is strictly prohibited. If you have received this > electronic information in error, please notify us immediately by > telephone at 866-584-2143. > > -----Original Message----- > From: BELLINI ADAM [mailto:[email protected]] > Sent: Thursday, December 10, 2009 3:48 PM > To: [email protected] > Subject: RE: how to force nutch to do a recrawl > > > it will not dump to the console ! > whole_db is a folder and you have to edit the file you will find in this > folder > > > > > Subject: RE: how to force nutch to do a recrawl > > Date: Thu, 10 Dec 2009 14:26:30 -0500 > > From: [email protected] > > To: [email protected] > > > > Adam, > > I tried running that command and get the following (it created a > > whole_db directory, but it's not dumping out the contents to the > > console): > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db > > CrawlDb dump: starting > > CrawlDb db: crawl/crawldb/ > > CrawlDb dump: done > > > > Vijaya Peters > > SRA International, Inc. > > 4350 Fair Lakes Court North > > Room 4004 > > Fairfax, VA 22033 > > Tel: 703-502-1184 > > > > www.sra.com > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > consecutive years > > P Please consider the environment before printing this e-mail > > This electronic message transmission contains information from SRA > > International, Inc. which may be confidential, privileged or > > proprietary. The information is intended for the use of the > individual > > or entity named above. If you are not the intended recipient, be > aware > > that any disclosure, copying, distribution, or use of the contents of > > this information is strictly prohibited. If you have received this > > electronic information in error, please notify us immediately by > > telephone at 866-584-2143. > > -----Original Message----- > > From: BELLINI ADAM [mailto:[email protected]] > > Sent: Thursday, December 10, 2009 1:40 PM > > To: [email protected] > > Subject: RE: how to force nutch to do a recrawl > > > > > > hi, > > check the fetch time in your crawldb...you can dump all the crawldb > like > > this: > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db > > > > entries will look like this: > > > > http://www.YOUR_URL_TO_FETCH > > Status: 2 (db_fetched) > > Fetch time: Thu Dec 10 09:19:18 EST 2009 > > Modified time: Wed Dec 31 19:00:00 EST 1969 > > Retries since fetch: 0 > > Retry interval: 18000 seconds (0 days) > > Score: 0.0014977538 > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c > > Metadata: _pst_: success(1), lastModified=0 > > > > > > as you see the next time the page will be fetched is in fetch time : > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009' > > and check the rety interval : it should be your 3600. > > > > hope it will help > > > > > > > Subject: RE: how to force nutch to do a recrawl > > > Date: Wed, 9 Dec 2009 16:06:58 -0500 > > > From: [email protected] > > > To: [email protected] > > > > > > Okay. I'll dig a little deeper. I saw a few scripts that people > had > > > created, but I couldn't get them to work. > > > > > > Thanks much. > > > > > > Vijaya Peters > > > SRA International, Inc. > > > 4350 Fair Lakes Court North > > > Room 4004 > > > Fairfax, VA 22033 > > > Tel: 703-502-1184 > > > > > > www.sra.com > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > consecutive years > > > P Please consider the environment before printing this e-mail > > > This electronic message transmission contains information from SRA > > > International, Inc. which may be confidential, privileged or > > > proprietary. The information is intended for the use of the > > individual > > > or entity named above. If you are not the intended recipient, be > > aware > > > that any disclosure, copying, distribution, or use of the contents > of > > > this information is strictly prohibited. If you have received this > > > electronic information in error, please notify us immediately by > > > telephone at 866-584-2143. > > > > > > -----Original Message----- > > > From: MilleBii [mailto:[email protected]] > > > Sent: Wednesday, December 09, 2009 4:05 PM > > > To: [email protected] > > > Subject: Re: how to force nutch to do a recrawl > > > > > > I don't that you can use nutch crawl command to do that, this is a > one > > > stop > > > shop command. > > > You probably want to use individual commands. > > > Type nutch generate to get the help and you will see the option > > > -adddays, > > > read that page on the wiki to get a feel how you should do: > > > http://wiki.apache.org/nutch/Crawl > > > > > > 2009/12/9 Peters, Vijaya <[email protected]> > > > > > > > I didn't see a setting to override in crawl-urlfilter. How do I > set > > > > numberDays? I have regular expressions to include/exclude certain > > > extensions > > > > and certain urls, but that's all I have in there. > > > > > > > > Please send me an example and I'll give it a try. > > > > > > > > Thanks! > > > > > > > > Vijaya Peters > > > > SRA International, Inc. > > > > 4350 Fair Lakes Court North > > > > Room 4004 > > > > Fairfax, VA 22033 > > > > Tel: 703-502-1184 > > > > > > > > www.sra.com > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > consecutive > > > > years > > > > P Please consider the environment before printing this e-mail > > > > This electronic message transmission contains information from SRA > > > > International, Inc. which may be confidential, privileged or > > > proprietary. > > > > The information is intended for the use of the individual or > entity > > > named > > > > above. If you are not the intended recipient, be aware that any > > > disclosure, > > > > copying, distribution, or use of the contents of this information > is > > > > strictly prohibited. If you have received this electronic > > information > > > in > > > > error, please notify us immediately by telephone at 866-584-2143. > > > > > > > > -----Original Message----- > > > > From: xiao yang [mailto:[email protected]] > > > > Sent: Wednesday, December 09, 2009 1:41 PM > > > > To: [email protected] > > > > Subject: Re: how to force nutch to do a recrawl > > > > > > > > What about the configuration in crawl-urlfilter.txt? > > > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya > > > <[email protected]> > > > > wrote: > > > > > I tried that too. > > > > > in Nutch-site.xml, I added in the below, but this had no effect. > > > > > > > > > > <property> > > > > > <name>db.default.fetch.interval</name> > > > > > <value>0</value> > > > > > <description>(DEPRECATED) The default number of days between > > > re-fetches > > > > of a page. value was 30 > > > > > </description> > > > > > </property> > > > > > > > > > > <property> > > > > > <name>db.fetch.interval.default</name> > > > > > <value>3600</value> > > > > > <description>The default number of seconds between re-fetches > of > > a > > > page > > > > (30 days). value was 2592000 (30 days) > > > > > </description> > > > > > </property> > > > > > > > > > > <property> > > > > > <name>db.fetch.interval.max</name> > > > > > <value>3600</value> > > > > > <description>The maximum number of seconds between re-fetches > of > > a > > > page > > > > > (90 days). After this period every page in the db will be > > re-tried, > > > no > > > > > matter what is its status. value was 7776000 > > > > > </description> > > > > > </property> > > > > > > > > > > Vijaya Peters > > > > > SRA International, Inc. > > > > > 4350 Fair Lakes Court North > > > > > Room 4004 > > > > > Fairfax, VA 22033 > > > > > Tel: 703-502-1184 > > > > > > > > > > www.sra.com > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > > consecutive years > > > > > P Please consider the environment before printing this e-mail > > > > > This electronic message transmission contains information from > SRA > > > > International, Inc. which may be confidential, privileged or > > > proprietary. > > > > The information is intended for the use of the individual or > entity > > > named > > > > above. If you are not the intended recipient, be aware that any > > > disclosure, > > > > copying, distribution, or use of the contents of this information > is > > > > strictly prohibited. If you have received this electronic > > information > > > in > > > > error, please notify us immediately by telephone at 866-584-2143. > > > > > > > > > > -----Original Message----- > > > > > From: MilleBii [mailto:[email protected]] > > > > > Sent: Wednesday, December 09, 2009 1:27 PM > > > > > To: [email protected] > > > > > Subject: Re: how to force nutch to do a recrawl > > > > > > > > > > Nutch only recrawl every 30 days by default. So you set the > > > numberDays > > > > > adequately and it wil recrawl read nutch-default.xml to get the > > > > > details > > > > > > > > > > 2009/12/9, xiao yang <[email protected]>: > > > > >> What do you mean by "recrawl"? > > > > >> Does the following command meets what you need? > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > >> Change the destination directory to a different one with the > last > > > crawl. > > > > >> > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya > > > <[email protected]> > > > > >> wrote: > > > > >>> I'm running Nutch 1.0 in windows. How do I force Nutch to do > a > > > > complete > > > > >>> recrawl? > > > > >>> > > > > >>> > > > > >>> > > > > >>> thanks, > > > > >>> > > > > >>> - Vijaya > > > > >>> > > > > >>> > > > > >>> > > > > >>> Vijaya Peters > > > > >>> SRA International, Inc. > > > > >>> 4350 Fair Lakes Court North > > > > >>> Room 4004 > > > > >>> Fairfax, VA 22033 > > > > >>> Tel: 703-502-1184 > > > > >>> > > > > >>> www.sra.com <http://www.sra.com/> > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for > 10 > > > > >>> consecutive years > > > > >>> > > > > >>> P Please consider the environment before printing this e-mail > > > > >>> > > > > >>> This electronic message transmission contains information from > > SRA > > > > >>> International, Inc. which may be confidential, privileged or > > > > >>> proprietary. The information is intended for the use of the > > > individual > > > > >>> or entity named above. If you are not the intended recipient, > > be > > > aware > > > > >>> that any disclosure, copying, distribution, or use of the > > contents > > > of > > > > >>> this information is strictly prohibited. If you have received > > > this > > > > >>> electronic information in error, please notify us immediately > by > > > > >>> telephone at 866-584-2143. > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > > > > > > -- > > > > > -MilleBii- > > > > > > > > > > > > > > > > > > > > > -- > > > -MilleBii- > > > > _________________________________________________________________ > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when > they > > e-mail you. > > http://go.microsoft.com/?linkid=9691817 > > _________________________________________________________________ > Windows Live: Make it easier for your friends to see what you're up to > on Facebook. > http://go.microsoft.com/?linkid=9691816 _________________________________________________________________ Windows Live: Make it easier for your friends to see what you’re up to on Facebook. http://go.microsoft.com/?linkid=9691816
