Jorg Heymans wrote: > Hi, > > I was wondering if it's possible to get crawl to go through a website and > only report links that return a specific http response code (eg 404) ? I'm > looking to somehow automate basic site testing of rather huge websites, > inevitably one ends up in the world of crawlers (and being a java guy > myself > this means nutch). > > I'm still going through the faq and first basic steps, so apologies if what > i'm asking is the most basic nutch-thing ever :)
I haven't used it yet - but I guess that's what the "store"-setting for the fetcher in nutch-config might be for. To my understanding this would allow you not to store the content fetched but only crawl the links. >From crawldb I guess you should be (somehow) able to see for which URLs retries were un-successful conducted etc. Maybe you could instead also just monitor the output of the fetcher while running? Would be nice to hear if you manage to set up a working solution imho. Regards, Stefan
