Thanks, Jiri, but the load comes from academic crawler prototypes firing from broad University infrastructures. Best Martin
On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote: > I wonder, are ways to link RDF data so that convential crawlers do not > crawl it, but only the semantic web aware ones do? > I am not sure how the current practice of linking by link tag in the > html headers could cause this, but it may be case that those heavy loads > come from a crawlers having nothing to do with semantic web... > Maybe we should start linking to our rdf/xml, turtle, ntriples files and > publishing sitemap info in RDFa... > > Best, > Jiri > > On 06/22/2011 09:00 AM, Steve Harris wrote: >> While I don't agree with Andreas exactly that it's the site owners fault, >> this is something that publishers of non-semantic data have to deal with. >> >> If you publish a large collection of interlinked data which looks >> interesting to conventional crawlers and is expensive to generate, >> conventional web crawlers will be all over it. The main difference is that a >> greater percentage of those are written properly, to follow robots.txt and >> the guidelines about hit frequency (maximum 1 request per second per domain, >> no parallel crawling). >> >> Has someone published similar guidelines for semantic web crawlers? >> >> The ones that don't behave themselves get banned, either in robots.txt, or >> explicitly by the server. >> >> - Steve >> >> On 2011-06-22, at 06:07, Martin Hepp wrote: >> >>> Hi Daniel, >>> Thanks for the link! I will relay this to relevant site-owners. >>> >>> However, I still challenge Andreas' statement that the site-owners are to >>> blame for publishing large amounts of data on small servers. >>> >>> One can publish 10,000 PDF documents on a tiny server without being hit by >>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF? >>> >>> But for sure, it is necessary to advise all publishers of large RDF >>> datasets to protect themselves against hungry crawlers and actual DoS >>> attacks. >>> >>> Imagine if a large site was brought down by a botnet that is exploiting >>> Semantic Sitemap information for DoS attacks, focussing on the large dump >>> files. >>> This could end LOD experiments for that site. >>> >>> >>> Best >>> >>> Martin >>> >>> >>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: >>> >>>> >>>> Hi Martin, >>>> >>>> Have you tried to put a Squid [1] as reverse proxy in front of your >>>> servers and use delay pools [2] to catch hungry crawlers? >>>> >>>> Cheers, >>>> Daniel >>>> >>>> [1] http://www.squid-cache.org/ >>>> [2] http://wiki.squid-cache.org/Features/DelayPools >>>> >>>> On 21.06.2011, at 09:49, Martin Hepp wrote: >>>> >>>>> Hi all: >>>>> >>>>> For the third time in a few weeks, we had massive complaints from >>>>> site-owners that Semantic Web crawlers from Universities visited their >>>>> sites in a way close to a denial-of-service attack, i.e., crawling data >>>>> with maximum bandwidth in a parallelized approach. >>>>> >>>>> It's clear that a single, stupidly written crawler script, run from a >>>>> powerful University network, can quickly create terrible traffic load. >>>>> >>>>> Many of the scripts we saw >>>>> >>>>> - ignored robots.txt, >>>>> - ignored clear crawling speed limitations in robots.txt, >>>>> - did not identify themselves properly in the HTTP request header or >>>>> lacked contact information therein, >>>>> - used no mechanisms at all for limiting the default crawling speed and >>>>> re-crawling delays. >>>>> >>>>> This irresponsible behavior can be the final reason for site-owners to >>>>> say farewell to academic/W3C-sponsored semantic technology. >>>>> >>>>> So please, please - advise all of your colleagues and students to NOT >>>>> write simple crawler scripts for the billion triples challenge or >>>>> whatsoever without familiarizing themselves with the state of the art in >>>>> "friendly crawling". >>>>> >>>>> Best wishes >>>>> >>>>> Martin Hepp >>>>> >>>> >>> >>> >> >
