Hi Guys

Just noticed something (nutch 0.6) - when extracting urls
from pages referencing the parent directory eg:
The page is http://www.xyz.com/corporate/aboutus.asp and
the link on that page is <a href="../more.asp">Example</a>
I get tons of files being referenced that include other
pages for example http://www.xyz.com/corporate/../more.asp
- now this should theoretically be a 404 not found error
however Ive found 1 or 2 sites that have a custom 404 page
which nutch doesnt pick up as page not found - us there
anyway that I can set somewhere in nutch's config that if
certain words in the title or on the page exist (eg. ERROR
404 or PAGE NOT FOUND or ERROR) that nutch will not index?

I hope you understand the problem?

Thanks!
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Reply via email to