Re: Deleting stale URLs from Nutch/Solr

Andrzej Bialecki Mon, 26 Oct 2009 09:27:25 -0700

Gora Mohanty wrote:

Hi,


  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?

  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.

Hi,

Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They arekept in Nutch crawldb to prevent their re-discovery (through stale linkspointing to these URL-s from other pages). If you really want to removethem from CrawlDb you can filter them out (using CrawlDbMerger with justone input db, and setting your URLFilters appropriately).

Now when it comes to removing them from Solr ... The simplest (nocoding) way would be to dump the CrawlDb, use some scripting tools tocollect just the URL-s with the status GONE, and send them as a <delete>command to Solr. A slightly more involved solution would be to implementa tool that reads such URLs directly from CrawlDb (using e.g.CrawlDbReader API) and then uses SolrJ API to send the same deleterequests + commit.




--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Deleting stale URLs from Nutch/Solr

Reply via email to