Hi Guys Just noticed something (nutch 0.6) - when extracting urls from pages referencing the parent directory eg: The page is http://www.xyz.com/corporate/aboutus.asp and the link on that page is <a href="../more.asp">Example</a> I get tons of files being referenced that include other pages for example http://www.xyz.com/corporate/../more.asp - now this should theoretically be a 404 not found error however Ive found 1 or 2 sites that have a custom 404 page which nutch doesnt pick up as page not found - us there anyway that I can set somewhere in nutch's config that if certain words in the title or on the page exist (eg. ERROR 404 or PAGE NOT FOUND or ERROR) that nutch will not index?
I hope you understand the problem? Thanks! _____________________________________________________________________ For super low premiums, click here http://www.dialdirect.co.za/quote
