Le 21 juin 2011 à 03:49, Martin Hepp a écrit : > Many of the scripts we saw > - ignored robots.txt, > - ignored clear crawling speed limitations in robots.txt, > - did not identify themselves properly in the HTTP request header or lacked > contact information therein, > - used no mechanisms at all for limiting the default crawling speed and > re-crawling delays.
Do you have a list of those and how to identify them? So we can put them in our blocking lists? .htaccess or Apache config with rules such as: # added for abusive downloads or not respecting robots.txt SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot # [… cut part of my list …] Order Allow,Deny Deny from 85.88.12.104 Deny from env=bad_bot Allow from all -- Karl Dubost - http://dev.opera.com/ Developer Relations & Tools, Opera Software
