Re: Think before you write Semantic Web crawlers

Kingsley Idehen Wed, 22 Jun 2011 07:18:23 -0700

On 6/22/11 12:54 PM, Hugh Glaser wrote:

Hi Chris.
One way to do the caching really efficiently:
http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
Which is what rkb has always done.
But of course caching does not solve the problem of one bad crawler.

Or a SPARQL query gone horribly wrong albeit inadvertently. In short,this is the most challenging case of all. We even had to protect againstthe same thing re. SQL access via ODBC, JDBC, ADO.NET, and OLE-DB.Basically, why we still have a business selling drivers when DBMSvendors offer free variants.

It actually makes it worse.
You add a cache write cost to the query, without a significant probability of a 
future cache hit. And increase disk usage.


Yes, and WebID adds fidelity to such inevitable challenges.

This is why (IMHO) WebID is the second most important innovationfollowing the URI re., Linked Data.



Kingsley

Hugh

----- Reply message -----
From: "Christopher Gutteridge"<[email protected]>
To: "Martin Hepp"<[email protected]>
Cc: "Daniel Herzig"<[email protected]>, "[email protected]"<[email protected]>, 
"[email protected]"<[email protected]>
Subject: Think before you write Semantic Web crawlers
Date: Wed, Jun 22, 2011 9:18 am

The difference between these two scenarios is that there's almost no CPU 
involvement in serving the PDF file, but naive RDF sites use lots of cycles to 
generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg. 
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
live, but this is not efficient. My colleague, Dave Challis, has prepared a 
SPARQL endpoint which caches results which we can turn on if the load gets too 
high, which should at least mitigate the problem. Very few datasets change in a 
24 hours period.

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.

Best

Martin

On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:

Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:

Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write simple 
crawler scripts for the billion triples challenge or whatsoever without familiarizing 
themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp

--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to