Re: Think before you write Semantic Web crawlers

Kingsley Idehen Wed, 22 Jun 2011 03:43:57 -0700

On 6/22/11 10:42 AM, Martin Hepp wrote:

Just to inform the community that the BTC / research crawlers have been 
successful in killing a major RDF source for e-commerce:


OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at 
http://openean.kaufkauf.net/id/ has been permanently shut down by the site 
operator because fighting with bad semweb crawlers is taking too much of his time.

Thanks a lot for everybody who contributed to that. It trashes a month of work 
and many million useful triples.


Martin,

Is there a dump anywhere? Can they at least continue to produce RDF dumps?

We have some of their data (from prior dump loads) in our lod cloudcache [1].


Links:

1.http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fopenean.kaufkauf.net%2Fid%2F&urilookup=1



Kingsley

Best

Martin Hepp



On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:

Hello!

The difference between these two scenarios is that there's almost no CPU
involvement in serving the PDF file, but naive RDF sites use lots of cycles
to generate the response to a query for an RDF document.

Right now queries to data.southampton.ac.uk (eg.
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
live, but this is not efficient. My colleague, Dave Challis, has prepared a
SPARQL endpoint which caches results which we can turn on if the load gets
too high, which should at least mitigate the problem. Very few datasets
change in a 24 hours period.

Hmm, I would strongly argue it is not the case (and stale datasets are
a bit issue in LOD imho!). The data on the BBC website, for example,
changes approximately 10 times a second.

We've also been hit in the past (and still now, to a lesser extent) by
badly behaving crawlers. I agree that, as we don't provide dumps, it
is the only way to generate an aggregation of BBC data, but we've had
downtime in the past caused by crawlers. After that happened, it
caused lots of discussions on whether we should publish RDF data at
all (thankfully, we succeeded to argue that we should keep it - but
that's a lot of time spent arguing instead of publishing new juicy RDF
data!)

I also want to point out (in response to Andreas's email) that HTTP
caches are *completely* inefficient to protect a dataset against that,
as crawlers tend to be exhaustive. ETags and Expiry headers are
helpful, but chances are that 1) you don't know when the data will
change, you can just make a wild guess based on previous behavior 2)
the cache would have expired the time the crawler requests a document
a second time, as it has ~100M (in our case) documents to crawl
through.

Request throttling would work, but you would have to find a way to
identify crawlers, which is tricky: most of them use multiple IPs and
don't set appropriate user agents (the crawlers that currently hit us
the most are wget and Java 1.6 :/ ).

So overall, there is no excuse for badly behaving crawlers!

Cheers,
y

Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to
blame for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets
to protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting
Semantic Sitemap information for DoS attacks, focussing on the large dump
files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:



Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:



Hi all:

For the third time in a few weeks, we had massive complaints from
site-owners that Semantic Web crawlers from Universities visited their sites
in a way close to a denial-of-service attack, i.e., crawling data with
maximum bandwidth in a parallelized approach.

It's clear that a single, stupidly written crawler script, run from a
powerful University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write
simple crawler scripts for the billion triples challenge or whatsoever
without familiarizing themselves with the state of the art in "friendly
crawling".

Best wishes

Martin Hepp





--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to