For some reason I thought my crawl was complete, not its not. I opened the log and noticed there are a lot of mapquest and google maps pages being crawled, why are these locations being crawled, when the only url in my filter list is irs.gov.
-----Original Message----- From: Jeff Ritchie [mailto:[EMAIL PROTECTED] Sent: Monday, February 27, 2006 10:17 AM To: [email protected] Subject: Re: injecting new urls Hi Richard, Under the default settings none of your IRS links should be fetched again for 30 days. Jeff. Richard Braman wrote: >I have read the recent posts on this subject but want to confirm my >understanding of what I am doing. I just completed my initial crawl >using www.irs.gov. I now want add some more tax related urls and fetch >those, but without doing irs.gov again. I created a newurls file and >also put regexps in the url filter. > >Are these commands what I should do? > >bin/nutch inject db newurls >bin/nutch generate db segments >bin/nutch fetch segments/<latest_segment> > >Should I remove irs.gov from the filter so I doesnt get done again >because some of the other new urls link back to irs.gov? > > > >Richard Braman >mailto:[EMAIL PROTECTED] >561.748.4002 (voice) > >http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> >Free Open Source Tax Software > > > > >
