Hi Richard,
Under the default settings none of your IRS links should be fetched
again for 30 days.
Jeff.
Richard Braman wrote:
I have read the recent posts on this subject but want to confirm my
understanding of what I am doing. I just completed my initial crawl
using www.irs.gov. I now want add some more tax related urls and fetch
those, but without doing irs.gov again.
I created a newurls file and also put regexps in the url filter.
Are these commands what I should do?
bin/nutch inject db newurls
bin/nutch generate db segments
bin/nutch fetch segments/<latest_segment>
Should I remove irs.gov from the filter so I doesnt get done again
because some of the other new urls link back to irs.gov?
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software