Re: Think before you write Semantic Web crawlers

Karl Dubost Wed, 22 Jun 2011 07:36:13 -0700

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
> Many of the scripts we saw
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked 
> contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.



Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all



-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Re: Think before you write Semantic Web crawlers

Reply via email to