Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the fetchlist
and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?
Long version:
My setup is simple. I crawl a number of internet forums. This requires me
to scan new posts every night to stay on top of things.
I crawled all of the older posts on these forums a while ago, and now have
to just worry about newer posts. I have written a small script that injects
the pages that have changed or the new pages each night.
When I run the recrawl script, I only want to crawl the pages that are
injected into the fetchlist (via bin/nutch inject). I have also changed the
default nutch recrawl time interval (normally 30 days) to a VERY large
number to ensure that nutch will not recrawl old pages for a very long time.
Anyway, back to my original question.
i recrawled today hoping that nutch would ONLY recrawl the 3000 documents I
injected (via bin/nutch inject). I used depth of 1 and left the adddays
parameter blank (because I really can't get a clear idea of what it does).
Depth of 1 is used because I only want to crawl the URLs I have injected
into the fetchlist and not have nutch go crazy on other domains, documents,
etc. Using the regex-urlfilter I have also ensured that it will only crawl
the domains I want it to crawl.
So my command looks something like this:
/home/nutch/recrawl.sh /home/nutch/database 1
my recrawl script can be seen here: http://www.honda-search.com/script.html
Much to my surprised Nutch is recrawling EVERY document in my webdb (plus, I
assume, the newly injected documents). Is this because the adddays variable
is left blank? Should I set the addays variable really high? How can I
ensure that it only crawls the urls that are injected?
Can anyone explain what adddays does (in easy to understand terms?) The
wiki isn't very clear for a newbie like myself.