Hi Martin, Have you tried to put a Squid [1] as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers? May be that helps.
Cheers, Daniel [1] http://www.squid-cache.org/ [2] http://wiki.squid-cache.org/Features/DelayPools On 21.06.2011, at 09:49, Martin Hepp wrote: > Hi all: > > For the third time in a few weeks, we had massive complaints from site-owners > that Semantic Web crawlers from Universities visited their sites in a way > close to a denial-of-service attack, i.e., crawling data with maximum > bandwidth in a parallelized approach. > > It's clear that a single, stupidly written crawler script, run from a > powerful University network, can quickly create terrible traffic load. > > Many of the scripts we saw > > - ignored robots.txt, > - ignored clear crawling speed limitations in robots.txt, > - did not identify themselves properly in the HTTP request header or lacked > contact information therein, > - used no mechanisms at all for limiting the default crawling speed and > re-crawling delays. > > This irresponsible behavior can be the final reason for site-owners to say > farewell to academic/W3C-sponsored semantic technology. > > So please, please - advise all of your colleagues and students to NOT write > simple crawler scripts for the billion triples challenge or whatsoever > without familiarizing themselves with the state of the art in "friendly > crawling". > > Best wishes > > Martin Hepp >
