I haven't tried this (or even thought it through much),
but it seems an easy way to achieve this would be to set
the db.default.fetch.interval to an arbitrarily large number
(maybe 36500 days, or 100 years). All pages that are
fetched will not be re-fetched for 100 years. So only
newly injected pages will be fetched when you re-crawl.
I can't remember if there's code in 0.7 that intelligently
resets the fetch time according to some algorithm.
If there is, this might not work without some code modifications.
The place to look is probably in UpdateDatabaseTool.java,
in the pageContentsChanged and pageContentsUnchanged
methods. You might have to change the calls to
setNextFetchTime according to your needs.
Howie
I'm having a difficult time configuring nutch to behave the way I want it
to behave.
In a nutshell here is my situation:
I crawl a number of forums that relate to Hondas every night for posts.
The purpose of my website is to be a search engine for all of the forums at
once.
I have a base set of URLs in the webDB right now. Every day I write a file
of URLs (that I place in urls/inject.txt) that I want nutch to inject into
the database to crawl. I do NOT want to recrawl other URLS. I only want
to crawl/recrawl the urls in my list.
Can you help me configure nutch (or help with the correct scripts, crons,
etc.) to do this? i've tried without success.
I am running nutch 0.7.2 and am totally confused with what to do next. It
seems to me to be a simple fix, but I can't figure it out.
As I mentioned I will pay if someone can set me up. I've run the crawl a
number of times now and i just keep on screwing things up.
Matt