I get tons of files being referenced that include other pages for example http://www.xyz.com/corporate/../more.asp - now this should theoretically be a 404 not found error however Ive found 1 or 2 sites that have a custom 404 page which nutch doesnt pick up as page not found - us there anyway that I can set somewhere in nutch's config that if certain words in the title or on the page exist (eg. ERROR 404 or PAGE NOT FOUND or ERROR) that nutch will not index?
You could implement a html filter plugin that looks for these strings in the title and, when they're found, throws a ParseException to abort the page.
Doug
