[EMAIL PROTECTED] wrote:
I get tons of files being referenced that include other
pages for example http://www.xyz.com/corporate/../more.asp
- now this should theoretically be a 404 not found error
however Ive found 1 or 2 sites that have a custom 404 page
which nutch doesnt pick up as page not found - us there
anyway that I can set somewhere in nutch's config that if
certain words in the title or on the page exist (eg. ERROR
404 or PAGE NOT FOUND or ERROR) that nutch will not index?

You could implement a html filter plugin that looks for these strings in the title and, when they're found, throws a ParseException to abort the page.


Doug

Reply via email to