I think I found a similar thread here:
http://mail-archives.apache.org/mod_mbox/incubator-nutch-user/200503.mbox/[EMAIL
PROTECTED]
The upshot was:
... use the commands described in the internet crawling tutorial.
http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling
Hi,
I've followed the instructions to set up an Intranet Search Engine, but
wondered about updating it with new pages. Do I just have to rerun the
crawl everyday or can I use nutch update in some way?
Also I've set the following property in nutch-site.xml
<property>
<name>db.default.fetch.interval</name>
<value>1</value>
<description>The default number of days between re-fetches of a page.
</description>
</property>
Am I right in thinking this configures nutch to check the current pages it
knows about are still valid, and takes them out if not?
Thanks for any help.
JS.