Re: Adddays confusion - easy question for the experts

Honda-Search Administrator Tue, 11 Jul 2006 15:08:30 -0700

That's an awesome explanation Matt... Thanks :)

----- Original Message -----From: "Matthew Holt" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Tuesday, July 11, 2006 1:51 PM
Subject: Re: Adddays confusion - easy question for the experts

Honda-Search Administrator wrote:
Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into thefetchlist and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?

Long version:
My setup is simple. I crawl a number of internet forums. This requiresme to scan new posts every night to stay on top of things.
I crawled all of the older posts on these forums a while ago, and nowhave to just worry about newer posts. I have written a small script thatinjects the pages that have changed or the new pages each night.
When I run the recrawl script, I only want to crawl the pages that areinjected into the fetchlist (via bin/nutch inject). I have also changedthe default nutch recrawl time interval (normally 30 days) to a VERYlarge number to ensure that nutch will not recrawl old pages for a verylong time.
Anyway, back to my original question.
i recrawled today hoping that nutch would ONLY recrawl the 3000 documentsI injected (via bin/nutch inject). I used depth of 1 and left theadddays parameter blank (because I really can't get a clear idea of whatit does). Depth of 1 is used because I only want to crawl the URLs I haveinjected into the fetchlist and not have nutch go crazy on other domains,documents, etc. Using the regex-urlfilter I have also ensured that itwill only crawl the domains I want it to crawl.
So my command looks something like this:

/home/nutch/recrawl.sh /home/nutch/database 1
my recrawl script can be seen here:http://www.honda-search.com/script.html
Much to my surprised Nutch is recrawling EVERY document in my webdb(plus, I assume, the newly injected documents). Is this because theadddays variable is left blank? Should I set the addays variable reallyhigh? How can I ensure that it only crawls the urls that are injected?
Can anyone explain what adddays does (in easy to understand terms?) Thewiki isn't very clear for a newbie like myself.
I was looking for similar info. The adddays option advances the clockhowever many days you specify. The default for page reindexing is 30 days,so every 30 days the page will expire and nutch will reindex it. However,if you pass the param -adddays 31, it will advance the clock 31 days andcause every page to be reindexed.
If you pass the param -adddays 27 and you have the default reindexing setto be 30 days, nutch will reindex all pages older than 3 days. Correct meif I'm wrong.
 Matt

Re: Adddays confusion - easy question for the experts

Reply via email to