Gora Mohanty wrote:
Hi,
We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?
I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.
Hi,
Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are
kept in Nutch crawldb to prevent their re-discovery (through stale links
pointing to these URL-s from other pages). If you really want to remove
them from CrawlDb you can filter them out (using CrawlDbMerger with just
one input db, and setting your URLFilters appropriately).
Now when it comes to removing them from Solr ... The simplest (no
coding) way would be to dump the CrawlDb, use some scripting tools to
collect just the URL-s with the status GONE, and send them as a <delete>
command to Solr. A slightly more involved solution would be to implement
a tool that reads such URLs directly from CrawlDb (using e.g.
CrawlDbReader API) and then uses SolrJ API to send the same delete
requests + commit.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com