Re: nutch fetches already fetched urls again and again

Andrzej Bialecki Thu, 26 Feb 2009 08:42:43 -0800

Bartosz Gadzimski wrote:

NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawlloops.What I should do to force Nutch to fetch ONLY unfetched urls fromcrawldb?
You can fetch unfetched urls and those expired when you dont use -topNswitch
-topN gets only those with the higher score

While this is true, the reason why it works this way for the originalposter is the adddays=30 parameter - essentially this setting tellsNutch to expire all already fetched pages (i.e. mark them unfetched),which is what you're seeing. During normal operation you should NOT usethis parameter.


So, Nutch appears to be working correctly ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: nutch fetches already fetched urls again and again

Reply via email to