Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the fetchlist and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?

Long version:
My setup is simple. I crawl a number of internet forums. This requires me to scan new posts every night to stay on top of things.

I crawled all of the older posts on these forums a while ago, and now have to just worry about newer posts. I have written a small script that injects the pages that have changed or the new pages each night.

When I run the recrawl script, I only want to crawl the pages that are injected into the fetchlist (via bin/nutch inject). I have also changed the default nutch recrawl time interval (normally 30 days) to a VERY large number to ensure that nutch will not recrawl old pages for a very long time.

Anyway, back to my original question.

i recrawled today hoping that nutch would ONLY recrawl the 3000 documents I injected (via bin/nutch inject). I used depth of 1 and left the adddays parameter blank (because I really can't get a clear idea of what it does). Depth of 1 is used because I only want to crawl the URLs I have injected into the fetchlist and not have nutch go crazy on other domains, documents, etc. Using the regex-urlfilter I have also ensured that it will only crawl the domains I want it to crawl.

So my command looks something like this:

/home/nutch/recrawl.sh /home/nutch/database 1

my recrawl script can be seen here:  http://www.honda-search.com/script.html

Much to my surprised Nutch is recrawling EVERY document in my webdb (plus, I assume, the newly injected documents). Is this because the adddays variable is left blank? Should I set the addays variable really high? How can I ensure that it only crawls the urls that are injected?

Can anyone explain what adddays does (in easy to understand terms?) The wiki isn't very clear for a newbie like myself.

Reply via email to