Re: [Nutch-general] Adddays confusion - easy question for the experts

Honda-Search Administrator Tue, 11 Jul 2006 15:10:38 -0700

That's an awesome explanation Matt... Thanks :)

----- Original Message ----- 
From: "Matthew Holt" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, July 11, 2006 1:51 PM
Subject: Re: Adddays confusion - easy question for the experts



> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the 
>> fetchlist and not recrawl the entire webdb?
>> Can anyone explain to me (in simple terms) exactly what adddays does?
>>
>> Long version:
>> My setup is simple.  I crawl a number of internet forums.  This requires 
>> me to scan new posts every night to stay on top of things.
>>
>> I crawled all of the older posts on these forums a while ago, and now 
>> have to just worry about newer posts.  I have written a small script that 
>> injects the pages that have changed or the new pages each night.
>>
>> When I run the recrawl script, I only want to crawl the pages that are 
>> injected into the fetchlist (via bin/nutch inject).  I have also changed 
>> the default nutch recrawl time interval (normally 30 days)  to a VERY 
>> large number to ensure that nutch will not recrawl old pages for a very 
>> long time.
>>
>> Anyway, back to my original question.
>>
>> i recrawled today hoping that nutch would ONLY recrawl the 3000 documents 
>> I injected (via bin/nutch inject).  I used depth of 1 and left the 
>> adddays parameter blank (because I really can't get a clear idea of what 
>> it does). Depth of 1 is used because I only want to crawl the URLs I have 
>> injected into the fetchlist and not have nutch go crazy on other domains, 
>> documents, etc.  Using the regex-urlfilter I have also ensured that it 
>> will only crawl the domains I want it to crawl.
>>
>> So my command looks something like this:
>>
>> /home/nutch/recrawl.sh /home/nutch/database 1
>>
>> my recrawl script can be seen here: 
>> http://www.honda-search.com/script.html
>>
>> Much to my surprised Nutch is recrawling EVERY document in my webdb 
>> (plus, I assume, the newly injected documents).  Is this because the 
>> adddays variable is left blank?  Should I set the addays variable really 
>> high?  How can I ensure that it only crawls the urls that are injected?
>>
>> Can anyone explain what adddays does (in easy to understand terms?)  The 
>> wiki isn't very clear for a newbie like myself.
>>
> I was looking for similar info. The adddays option advances the clock 
> however many days you specify. The default for page reindexing is 30 days, 
> so every 30 days the page will expire and nutch will reindex it. However, 
> if you pass the param -adddays 31, it will advance the clock 31 days and 
> cause every page to be reindexed.
>
> If you pass the param -adddays 27 and you have the default reindexing set 
> to be 30 days, nutch will reindex all pages older than 3 days. Correct me 
> if I'm wrong.
>  Matt
>
> 



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Adddays confusion - easy question for the experts

Reply via email to