Bartosz Gadzimski wrote:
NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl

I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops. What I should do to force Nutch to fetch ONLY unfetched urls from crawldb?
You can fetch unfetched urls and those expired when you dont use -topN switch

-topN gets only those with the higher score

While this is true, the reason why it works this way for the original poster is the adddays=30 parameter - essentially this setting tells Nutch to expire all already fetched pages (i.e. mark them unfetched), which is what you're seeing. During normal operation you should NOT use this parameter.

So, Nutch appears to be working correctly ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to