Iain wrote:
I'm testing nutch with a view to exhaustive scraping (using version 0.8).
But I've got some sites that don't scrape and no idea why. Case in point is
http://www.idc.com.
This is a HUGE site, but I get nothing in nutch.
Check http://www.idc.com/robots.txt - it specifically disallows all
other robots (*) from accessing this site.
(and I agree that we should produce some message in the logs about this
...).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com