For some reason I thought my crawl was complete, not its not.  I opened
the log and noticed there are a lot of mapquest and google maps pages
being crawled, why are these locations being crawled, when the only url
in my filter list is irs.gov.

-----Original Message-----
From: Jeff Ritchie [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 27, 2006 10:17 AM
To: [email protected]
Subject: Re: injecting new urls


Hi Richard,

Under the default settings none of your IRS links should be fetched 
again for 30 days.

Jeff.

Richard Braman wrote:

>I have read the recent posts on this subject but want to confirm my 
>understanding of what I am doing.  I just completed my initial crawl 
>using www.irs.gov.  I now want add some more tax related urls and fetch

>those, but without doing irs.gov again. I created a newurls file and 
>also put regexps in the url filter.
> 
>Are these commands what I should do?
> 
>bin/nutch inject db newurls
>bin/nutch generate db segments
>bin/nutch fetch segments/<latest_segment>
> 
>Should I remove irs.gov from the filter so I doesnt get done again 
>because some of the other new urls link back to irs.gov?
> 
> 
>
>Richard Braman
>mailto:[EMAIL PROTECTED]
>561.748.4002 (voice)
>
>http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
>Free Open Source Tax Software
>
> 
>
>  
>

Reply via email to