Re: using nutch to detect broken pages

Stefan Neufeind Wed, 24 May 2006 15:22:49 -0700

Jorg Heymans wrote:
> Hi,
> 
> I was wondering if it's possible to get crawl to go through a website and
> only report links that return a specific http response code (eg 404) ? I'm
> looking to somehow automate basic site testing of rather huge websites,
> inevitably one ends up in the world of crawlers (and being a java guy
> myself
> this means nutch).
> 
> I'm still going through the faq and first basic steps, so apologies if what
> i'm asking is the most basic nutch-thing ever :)


I haven't used it yet - but I guess that's what the "store"-setting for
the fetcher in nutch-config might be for. To my understanding this would
allow you not to store the content fetched but only crawl the links.
>From crawldb I guess you should be (somehow) able to see for which URLs
retries were un-successful conducted etc.

Maybe you could instead also just monitor the output of the fetcher
while running?

Would be nice to hear if you manage to set up a working solution imho.


Regards,
 Stefan

Re: using nutch to detect broken pages

Reply via email to