Gora Mohanty wrote:
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki <a...@getopt.org> wrote:
[...]
Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).
[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.

I assume you mean that the "generate" step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days.

You can pretend that the time moved on using the -adddays parameter. Then Nutch will generate a new fetchlist, and when it discovers that the page is missing it will mark it as gone - actually, you could then take that information directly from the Nutch segment and instead of processing the CrawlDb you could process the segment to collect a partial list of gone pages.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to