Adddays confusion - easy question for the experts

Honda-Search Administrator Sat, 24 Jun 2006 11:08:40 -0700

Reader's Digest version:

How can I ensure that nutch only crawls the urls I inject into the fetchlistand not recrawl the entire webdb?

Can anyone explain to me (in simple terms) exactly what adddays does?


Long version:

My setup is simple. I crawl a number of internet forums. This requires meto scan new posts every night to stay on top of things.

I crawled all of the older posts on these forums a while ago, and now haveto just worry about newer posts. I have written a small script that injectsthe pages that have changed or the new pages each night.

When I run the recrawl script, I only want to crawl the pages that areinjected into the fetchlist (via bin/nutch inject). I have also changed thedefault nutch recrawl time interval (normally 30 days) to a VERY largenumber to ensure that nutch will not recrawl old pages for a very long time.


Anyway, back to my original question.

i recrawled today hoping that nutch would ONLY recrawl the 3000 documents Iinjected (via bin/nutch inject). I used depth of 1 and left the adddaysparameter blank (because I really can't get a clear idea of what it does).Depth of 1 is used because I only want to crawl the URLs I have injectedinto the fetchlist and not have nutch go crazy on other domains, documents,etc. Using the regex-urlfilter I have also ensured that it will only crawlthe domains I want it to crawl.


So my command looks something like this:

/home/nutch/recrawl.sh /home/nutch/database 1

my recrawl script can be seen here:  http://www.honda-search.com/script.html

Much to my surprised Nutch is recrawling EVERY document in my webdb (plus, Iassume, the newly injected documents). Is this because the adddays variableis left blank? Should I set the addays variable really high? How can Iensure that it only crawls the urls that are injected?

Can anyone explain what adddays does (in easy to understand terms?) Thewiki isn't very clear for a newbie like myself.

Adddays confusion - easy question for the experts

Reply via email to