Complex problem of recrawling economically

Manoharam Reddy Mon, 04 Jun 2007 21:31:42 -0700

Hi,

I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.


Let's say in my intranet there are 1000 sites.

Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But extra
pages may be added to these sites.

Sites 101 to 500 have pretty dynamic content in which I can expect the
content to change significantly every 7 days.

Sites 501 to 1000 are very dynamic and content change can happen in
any page almost every day.

So, how can I do recrawls on them in a manner that

1) it doesn't crawl the existing pages of the first group (1-100)
sites but crawl the new pages that have come up.

2) re-crawl all pages of the second group at an interval of 7 days.

3) re-crawl all pages of the third group every day

4) it crawls any new URLs injected into the crawl db during recrawl.

Complex problem of recrawling economically

Reply via email to