Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?
May be that helps.

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:

> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from site-owners 
> that Semantic Web crawlers from Universities visited their sites in a way 
> close to a denial-of-service attack, i.e., crawling data with maximum 
> bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a 
> powerful University network, can quickly create terrible traffic load. 
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked 
> contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to say 
> farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT write 
> simple crawler scripts for the billion triples challenge or whatsoever 
> without familiarizing themselves with the state of the art in "friendly 
> crawling".
> 
> Best wishes
> 
> Martin Hepp
> 



Reply via email to