Hi Richard,

Under the default settings none of your IRS links should be fetched again for 30 days.

Jeff.

Richard Braman wrote:

I have read the recent posts on this subject but want to confirm my
understanding of what I am doing.  I just completed my initial crawl
using www.irs.gov.  I now want add some more tax related urls and fetch
those, but without doing irs.gov again.
I created a newurls file and also put regexps in the url filter.

Are these commands what I should do?

bin/nutch inject db newurls
bin/nutch generate db segments
bin/nutch fetch segments/<latest_segment>

Should I remove irs.gov from the filter so I doesnt get done again
because some of the other new urls link back to irs.gov?



Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free Open Source Tax Software




Reply via email to