That's an awesome explanation Matt... Thanks :) ----- Original Message ----- From: "Matthew Holt" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Tuesday, July 11, 2006 1:51 PM Subject: Re: Adddays confusion - easy question for the experts
> Honda-Search Administrator wrote: >> Reader's Digest version: >> How can I ensure that nutch only crawls the urls I inject into the >> fetchlist and not recrawl the entire webdb? >> Can anyone explain to me (in simple terms) exactly what adddays does? >> >> Long version: >> My setup is simple. I crawl a number of internet forums. This requires >> me to scan new posts every night to stay on top of things. >> >> I crawled all of the older posts on these forums a while ago, and now >> have to just worry about newer posts. I have written a small script that >> injects the pages that have changed or the new pages each night. >> >> When I run the recrawl script, I only want to crawl the pages that are >> injected into the fetchlist (via bin/nutch inject). I have also changed >> the default nutch recrawl time interval (normally 30 days) to a VERY >> large number to ensure that nutch will not recrawl old pages for a very >> long time. >> >> Anyway, back to my original question. >> >> i recrawled today hoping that nutch would ONLY recrawl the 3000 documents >> I injected (via bin/nutch inject). I used depth of 1 and left the >> adddays parameter blank (because I really can't get a clear idea of what >> it does). Depth of 1 is used because I only want to crawl the URLs I have >> injected into the fetchlist and not have nutch go crazy on other domains, >> documents, etc. Using the regex-urlfilter I have also ensured that it >> will only crawl the domains I want it to crawl. >> >> So my command looks something like this: >> >> /home/nutch/recrawl.sh /home/nutch/database 1 >> >> my recrawl script can be seen here: >> http://www.honda-search.com/script.html >> >> Much to my surprised Nutch is recrawling EVERY document in my webdb >> (plus, I assume, the newly injected documents). Is this because the >> adddays variable is left blank? Should I set the addays variable really >> high? How can I ensure that it only crawls the urls that are injected? >> >> Can anyone explain what adddays does (in easy to understand terms?) The >> wiki isn't very clear for a newbie like myself. >> > I was looking for similar info. The adddays option advances the clock > however many days you specify. The default for page reindexing is 30 days, > so every 30 days the page will expire and nutch will reindex it. However, > if you pass the param -adddays 31, it will advance the clock 31 days and > cause every page to be reindexed. > > If you pass the param -adddays 27 and you have the default reindexing set > to be 30 days, nutch will reindex all pages older than 3 days. Correct me > if I'm wrong. > Matt > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
