Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load. 

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein, 
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write 
simple crawler scripts for the billion triples challenge or whatsoever without 
familiarizing themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp
 

Reply via email to