Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But extra pages may be added to these sites. Sites 101 to 500 have pretty dynamic content in which I can expect the content to change significantly every 7 days. Sites 501 to 1000 are very dynamic and content change can happen in any page almost every day. So, how can I do recrawls on them in a manner that 1) it doesn't crawl the existing pages of the first group (1-100) sites but crawl the new pages that have come up. 2) re-crawl all pages of the second group at an interval of 7 days. 3) re-crawl all pages of the third group every day 4) it crawls any new URLs injected into the crawl db during recrawl.
