Re: Think before you write Semantic Web crawlers

Kingsley Idehen Wed, 22 Jun 2011 07:42:37 -0700

On 6/22/11 3:34 PM, Karl Dubost wrote:

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :

Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.


Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all

But that doesn't solve the big problem. An Apache module for WebID thatallows QoS algorithms or heuristics based on Trust Logics is the onlyway this will scale, ultimately. Apache can get with the program, viamodules. Henry and Joe and a few others are working on keeping Apache instep with the new Data Space dimension of the Web :-)


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to