Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the fetchlist 
and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?

Long version:
My setup is simple.  I crawl a number of internet forums.  This requires me 
to scan new posts every night to stay on top of things.

I crawled all of the older posts on these forums a while ago, and now have 
to just worry about newer posts.  I have written a small script that injects 
the pages that have changed or the new pages each night.

When I run the recrawl script, I only want to crawl the pages that are 
injected into the fetchlist (via bin/nutch inject).  I have also changed the 
default nutch recrawl time interval (normally 30 days)  to a VERY large 
number to ensure that nutch will not recrawl old pages for a very long time.

Anyway, back to my original question.

i recrawled today hoping that nutch would ONLY recrawl the 3000 documents I 
injected (via bin/nutch inject).  I used depth of 1 and left the adddays 
parameter blank (because I really can't get a clear idea of what it does). 
Depth of 1 is used because I only want to crawl the URLs I have injected 
into the fetchlist and not have nutch go crazy on other domains, documents, 
etc.  Using the regex-urlfilter I have also ensured that it will only crawl 
the domains I want it to crawl.

So my command looks something like this:

/home/nutch/recrawl.sh /home/nutch/database 1

my recrawl script can be seen here:  http://www.honda-search.com/script.html

Much to my surprised Nutch is recrawling EVERY document in my webdb (plus, I 
assume, the newly injected documents).  Is this because the adddays variable 
is left blank?  Should I set the addays variable really high?  How can I 
ensure that it only crawls the urls that are injected?

Can anyone explain what adddays does (in easy to understand terms?)  The 
wiki isn't very clear for a newbie like myself. 


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to