Reader's Digest version: How can I ensure that nutch only crawls the urls I inject into the fetchlist and not recrawl the entire webdb? Can anyone explain to me (in simple terms) exactly what adddays does?
Long version: My setup is simple. I crawl a number of internet forums. This requires me to scan new posts every night to stay on top of things. I crawled all of the older posts on these forums a while ago, and now have to just worry about newer posts. I have written a small script that injects the pages that have changed or the new pages each night. When I run the recrawl script, I only want to crawl the pages that are injected into the fetchlist (via bin/nutch inject). I have also changed the default nutch recrawl time interval (normally 30 days) to a VERY large number to ensure that nutch will not recrawl old pages for a very long time. Anyway, back to my original question. i recrawled today hoping that nutch would ONLY recrawl the 3000 documents I injected (via bin/nutch inject). I used depth of 1 and left the adddays parameter blank (because I really can't get a clear idea of what it does). Depth of 1 is used because I only want to crawl the URLs I have injected into the fetchlist and not have nutch go crazy on other domains, documents, etc. Using the regex-urlfilter I have also ensured that it will only crawl the domains I want it to crawl. So my command looks something like this: /home/nutch/recrawl.sh /home/nutch/database 1 my recrawl script can be seen here: http://www.honda-search.com/script.html Much to my surprised Nutch is recrawling EVERY document in my webdb (plus, I assume, the newly injected documents). Is this because the adddays variable is left blank? Should I set the addays variable really high? How can I ensure that it only crawls the urls that are injected? Can anyone explain what adddays does (in easy to understand terms?) The wiki isn't very clear for a newbie like myself. Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
