I haven't tried this (or even thought it through much),
but it seems an easy way to achieve this would be to set
the db.default.fetch.interval to an arbitrarily large number
(maybe 36500 days, or 100 years). All pages that are
fetched will not be re-fetched for 100 years. So only
newly injected pages will be fetched when you re-crawl.

I can't remember if there's code in 0.7 that intelligently
resets the fetch time according to some algorithm.
If there is, this might not work without some code modifications.
The place to look is probably in UpdateDatabaseTool.java,
in the pageContentsChanged and pageContentsUnchanged
methods. You might have to change the calls to
setNextFetchTime according to your needs.

Howie

>I'm having a difficult time configuring nutch to behave the way I want it 
>to behave.
>
>In a nutshell here is my situation:
>
>I crawl a number of forums that relate to Hondas every night for posts.  
>The purpose of my website is to be a search engine for all of the forums at 
>once.
>
>I have a base set of URLs in the webDB right now.  Every day I write a file 
>of URLs (that I place in urls/inject.txt) that I want nutch to inject into 
>the database to crawl.  I do NOT want to recrawl other URLS.  I only want 
>to crawl/recrawl the urls in my list.
>
>Can you help me configure nutch (or help with the correct scripts, crons, 
>etc.) to do this?  i've tried without success.
>
>I am running nutch 0.7.2 and am totally confused with what to do next.  It 
>seems to me to be a simple fix, but I can't figure it out.
>
>As I mentioned I will pay if someone can set me up.  I've run the crawl a 
>number of times now and i just keep on screwing things up.
>
>Matt
>



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to