I have read the recent posts on this subject but want to confirm my
understanding of what I am doing.  I just completed my initial crawl
using www.irs.gov.  I now want add some more tax related urls and fetch
those, but without doing irs.gov again.
I created a newurls file and also put regexps in the url filter.
 
Are these commands what I should do?
 
bin/nutch inject db newurls
bin/nutch generate db segments
bin/nutch fetch segments/<latest_segment>
 
Should I remove irs.gov from the filter so I doesnt get done again
because some of the other new urls link back to irs.gov?
 
 

Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

 

Reply via email to