Hi,

I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.

Let's say in my intranet there are 1000 sites.

Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But extra
pages may be added to these sites.

Sites 101 to 500 have pretty dynamic content in which I can expect the
content to change significantly every 7 days.

Sites 501 to 1000 are very dynamic and content change can happen in
any page almost every day.

So, how can I do recrawls on them in a manner that

1) it doesn't crawl the existing pages of the first group (1-100)
sites but crawl the new pages that have come up.

2) re-crawl all pages of the second group at an interval of 7 days.

3) re-crawl all pages of the third group every day

4) it crawls any new URLs injected into the crawl db during recrawl.

Reply via email to