Iain wrote:
I'm testing nutch with a view to exhaustive scraping (using version 0.8).

But I've got some sites that don't scrape and no idea why.  Case in point is
http://www.idc.com.

This is a HUGE site, but I get nothing in nutch.

Check http://www.idc.com/robots.txt - it specifically disallows all other robots (*) from accessing this site.

(and I agree that we should produce some message in the logs about this ...).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to