Bartosz Gadzimski wrote:
NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl
loops.
What I should do to force Nutch to fetch ONLY unfetched urls from
crawldb?
You can fetch unfetched urls and those expired when you dont use -topN
switch
-topN gets only those with the higher score
While this is true, the reason why it works this way for the original
poster is the adddays=30 parameter - essentially this setting tells
Nutch to expire all already fetched pages (i.e. mark them unfetched),
which is what you're seeing. During normal operation you should NOT use
this parameter.
So, Nutch appears to be working correctly ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com