Re: Think before you write Semantic Web crawlers

Martin Hepp Wed, 22 Jun 2011 06:22:15 -0700

Thanks, Jiri, but the load comes from academic crawler prototypes firing from 
broad University infrastructures.
Best
Martin



On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:

> I wonder, are ways to link RDF data so that convential crawlers do not
> crawl it, but only the semantic web aware ones do?
> I am not sure how the current practice of linking by link tag in the
> html headers could cause this, but it may be case that those heavy loads
> come from a crawlers having nothing to do with semantic web...
> Maybe we should start linking to our rdf/xml, turtle, ntriples files and
> publishing sitemap info in RDFa...
> 
> Best,
> Jiri
> 
> On 06/22/2011 09:00 AM, Steve Harris wrote:
>> While I don't agree with Andreas exactly that it's the site owners fault, 
>> this is something that publishers of non-semantic data have to deal with.
>> 
>> If you publish a large collection of interlinked data which looks 
>> interesting to conventional crawlers and is expensive to generate, 
>> conventional web crawlers will be all over it. The main difference is that a 
>> greater percentage of those are written properly, to follow robots.txt and 
>> the guidelines about hit frequency (maximum 1 request per second per domain, 
>> no parallel crawling).
>> 
>> Has someone published similar guidelines for semantic web crawlers?
>> 
>> The ones that don't behave themselves get banned, either in robots.txt, or 
>> explicitly by the server. 
>> 
>> - Steve
>> 
>> On 2011-06-22, at 06:07, Martin Hepp wrote:
>> 
>>> Hi Daniel,
>>> Thanks for the link! I will relay this to relevant site-owners.
>>> 
>>> However, I still challenge Andreas' statement that the site-owners are to 
>>> blame for publishing large amounts of data on small servers.
>>> 
>>> One can publish 10,000 PDF documents on a tiny server without being hit by 
>>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>> 
>>> But for sure, it is necessary to advise all publishers of large RDF 
>>> datasets to protect themselves against hungry crawlers and actual DoS 
>>> attacks.
>>> 
>>> Imagine if a large site was brought down by a botnet that is exploiting 
>>> Semantic Sitemap information for DoS attacks, focussing on the large dump 
>>> files. 
>>> This could end LOD experiments for that site.
>>> 
>>> 
>>> Best
>>> 
>>> Martin
>>> 
>>> 
>>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>> 
>>>> 
>>>> Hi Martin,
>>>> 
>>>> Have you tried to put a Squid [1]  as reverse proxy in front of your 
>>>> servers and use delay pools [2] to catch hungry crawlers?
>>>> 
>>>> Cheers,
>>>> Daniel
>>>> 
>>>> [1] http://www.squid-cache.org/
>>>> [2] http://wiki.squid-cache.org/Features/DelayPools
>>>> 
>>>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>>>> 
>>>>> Hi all:
>>>>> 
>>>>> For the third time in a few weeks, we had massive complaints from 
>>>>> site-owners that Semantic Web crawlers from Universities visited their 
>>>>> sites in a way close to a denial-of-service attack, i.e., crawling data 
>>>>> with maximum bandwidth in a parallelized approach.
>>>>> 
>>>>> It's clear that a single, stupidly written crawler script, run from a 
>>>>> powerful University network, can quickly create terrible traffic load. 
>>>>> 
>>>>> Many of the scripts we saw
>>>>> 
>>>>> - ignored robots.txt,
>>>>> - ignored clear crawling speed limitations in robots.txt,
>>>>> - did not identify themselves properly in the HTTP request header or 
>>>>> lacked contact information therein, 
>>>>> - used no mechanisms at all for limiting the default crawling speed and 
>>>>> re-crawling delays.
>>>>> 
>>>>> This irresponsible behavior can be the final reason for site-owners to 
>>>>> say farewell to academic/W3C-sponsored semantic technology.
>>>>> 
>>>>> So please, please - advise all of your colleagues and students to NOT 
>>>>> write simple crawler scripts for the billion triples challenge or 
>>>>> whatsoever without familiarizing themselves with the state of the art in 
>>>>> "friendly crawling".
>>>>> 
>>>>> Best wishes
>>>>> 
>>>>> Martin Hepp
>>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Think before you write Semantic Web crawlers

Reply via email to