Re: Deleting stale URLs from Nutch/Solr

Andrzej Bialecki Mon, 26 Oct 2009 23:29:43 -0700

Gora Mohanty wrote:

On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki <a...@getopt.org> wrote:
[...]

Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).

[...]


Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.

I assume you mean that the "generate" step produces no new URL-s tofetch? That's expected, because they become eligible for re-fetchingonly after Nutch considers them expired, i.e. after the fetchTime +fetchInterval, and the default fetchInterval is 30 days.

You can pretend that the time moved on using the -adddays parameter.Then Nutch will generate a new fetchlist, and when it discovers that thepage is missing it will mark it as gone - actually, you could then takethat information directly from the Nutch segment and instead ofprocessing the CrawlDb you could process the segment to collect apartial list of gone pages.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Deleting stale URLs from Nutch/Solr

Reply via email to