Would anyone be able to help me with this or provide a
sample of how to do this kind of thing?

Thanks

On Mon, 18 Apr 2005 10:38:11 -0700
 Doug Cutting <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
> > I get tons of files being referenced that include other
> > pages for example
> http://www.xyz.com/corporate/../more.asp
> > - now this should theoretically be a 404 not found
> error
> > however Ive found 1 or 2 sites that have a custom 404
> page
> > which nutch doesnt pick up as page not found - us there
> > anyway that I can set somewhere in nutch's config that
> if
> > certain words in the title or on the page exist (eg.
> ERROR
> > 404 or PAGE NOT FOUND or ERROR) that nutch will not
> index?
> 
> You could implement a html filter plugin that looks for
> these strings in the title and, when they're found,
> throws a ParseException to abort the page.
> 
> Doug

_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Reply via email to