Hi Guys

Just noticed something (nutch 0.6) - when extracting urls
from pages referencing the parent directory eg:
The page is http://www.xyz.com/corporate/aboutus.asp and
the link on that page is <a href="../more.asp">Example</a>
I get tons of files being referenced that include other
pages for example http://www.xyz.com/corporate/../more.asp
- now this should theoretically be a 404 not found error
however Ive found 1 or 2 sites that have a custom 404 page
which nutch doesnt pick up as page not found - us there
anyway that I can set somewhere in nutch's config that if
certain words in the title or on the page exist (eg. ERROR
404 or PAGE NOT FOUND or ERROR) that nutch will not index?

I hope you understand the problem?

Thanks!
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to