Would anyone be able to help me with this or provide a sample of how to do this kind of thing?
Thanks On Mon, 18 Apr 2005 10:38:11 -0700 Doug Cutting <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > I get tons of files being referenced that include other > > pages for example > http://www.xyz.com/corporate/../more.asp > > - now this should theoretically be a 404 not found > error > > however Ive found 1 or 2 sites that have a custom 404 > page > > which nutch doesnt pick up as page not found - us there > > anyway that I can set somewhere in nutch's config that > if > > certain words in the title or on the page exist (eg. > ERROR > > 404 or PAGE NOT FOUND or ERROR) that nutch will not > index? > > You could implement a html filter plugin that looks for > these strings in the title and, when they're found, > throws a ParseException to abort the page. > > Doug _____________________________________________________________________ For super low premiums, click here http://www.dialdirect.co.za/quote
