Deleting stale URLs from Nutch/Solr

Gora Mohanty Mon, 26 Oct 2009 06:37:06 -0700

Hi,

  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?


  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.

Regards,
Gora

Deleting stale URLs from Nutch/Solr

Reply via email to