To avoid recrawl to index unchanged content.

pavankumar Sun, 23 Dec 2007 22:03:04 -0800

Hi,
    I am able to successfully crawl using nutch 0.9 API and following steps
mentioned in doc. But when I am re-crawling, it is indexing even the
urls/files which have not changed also. How can I make nutch to index only
the content that has changed? I can not assume which filles/urls have
changed duirng a certain period. So I need to fetch all of them. But I want
to index only those files/urls which have their content changed after the
last crawl so that the recrawl time gets reduced. Actually my re-crawl is
taking more time compared to a fresh crawl. How can I improve the time spent
while re-crawling? Is it better to do a fresh crawl every time or do a
re-crawl?
-- 
View this message in context: 
http://www.nabble.com/To-avoid-recrawl-to-index-unchanged-content.-tp14484900p14484900.html
Sent from the Nutch - User mailing list archive at Nabble.com.

To avoid recrawl to index unchanged content.

Reply via email to