Hi, We are using Nutch to crawl an internal site, and index content to Solr. The issue is that the site is run through a CMS, and occasionally pages are deleted, so that the corresponding URLs become invalid. Is there any way that Nutch can discover stale URLs during recrawls, or is the only solution a completely fresh crawl? Also, is it possible to have Nutch automatically remove such stale content from Solr?
I am stumped by this problem, and would appreciate any pointers, or even thoughts on this. Regards, Gora