RE: Will pay for someone to help

Howie Wang Sun, 25 Jun 2006 10:44:14 -0700

I haven't tried this (or even thought it through much),
but it seems an easy way to achieve this would be to set
the db.default.fetch.interval to an arbitrarily large number
(maybe 36500 days, or 100 years). All pages that are
fetched will not be re-fetched for 100 years. So only
newly injected pages will be fetched when you re-crawl.


I can't remember if there's code in 0.7 that intelligently
resets the fetch time according to some algorithm.
If there is, this might not work without some code modifications.
The place to look is probably in UpdateDatabaseTool.java,
in the pageContentsChanged and pageContentsUnchanged
methods. You might have to change the calls to
setNextFetchTime according to your needs.

Howie

I'm having a difficult time configuring nutch to behave the way I want itto behave.
In a nutshell here is my situation:
I crawl a number of forums that relate to Hondas every night for posts.The purpose of my website is to be a search engine for all of the forums atonce.
I have a base set of URLs in the webDB right now. Every day I write a fileof URLs (that I place in urls/inject.txt) that I want nutch to inject intothe database to crawl. I do NOT want to recrawl other URLS. I only wantto crawl/recrawl the urls in my list.
Can you help me configure nutch (or help with the correct scripts, crons,etc.) to do this? i've tried without success.
I am running nutch 0.7.2 and am totally confused with what to do next. Itseems to me to be a simple fix, but I can't figure it out.
As I mentioned I will pay if someone can set me up. I've run the crawl anumber of times now and i just keep on screwing things up.
Matt

RE: Will pay for someone to help

Reply via email to