Gora Mohanty wrote:
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki <a...@getopt.org> wrote:
[...]
Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).
[...]
Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.
I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.
I assume you mean that the "generate" step produces no new URL-s to
fetch? That's expected, because they become eligible for re-fetching
only after Nutch considers them expired, i.e. after the fetchTime +
fetchInterval, and the default fetchInterval is 30 days.
You can pretend that the time moved on using the -adddays parameter.
Then Nutch will generate a new fetchlist, and when it discovers that the
page is missing it will mark it as gone - actually, you could then take
that information directly from the Nutch segment and instead of
processing the CrawlDb you could process the segment to collect a
partial list of gone pages.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com