Re: Think before you write Semantic Web crawlers

Kingsley Idehen Wed, 22 Jun 2011 12:09:13 -0700

On 6/22/11 7:57 PM, Martin Hepp wrote:

Jiri:
The crawlers causing problems were run by Universities, mostly in the context 
of ISWC submissions. No need to cast any doubt on that.


All:
As a consequence to those events, I will not publish sitemaps etc. of future 
GoodRelations datasets on these lists, but just inform non-toy consumers.
If you consider yourself a non-toy consumer of e-commerce data, please send me 
an e-mail, and we will add you to out ping chain.

We will also stop sending pings to PTSW, Watson, Swoogle, et al., because they 
will just expose sites adopting GoodRelations and related technology to 
academic crawling.

In the meantime, I recommend the LOD bubble diagram sources for 
self-referential research.


Martin,

Linked Data is Linked Data. Serendipitious Discovery Quotient of everyLINK is inherently high, and higher as the mesh gets denser. Once aLinked Data is out there is a path to crawl.

Inevitably Linked Data access needs ACL control. Luckily we actually dohave a solution in WebID. Maybe, we use this problem as another use casethis time addressing:


1. HTML+RDFa
2. Access Control Lists
3. Crawling.

We have to protect innovations via innovation. If Linked Data is usefulthen it should provide foundation for its consumption and publicationchallenges. These issues are just beginning, there is so much more to come.

Let's just solve the problem. Requesting good behavior will never bringstability or decorum to a jungle full of critters. We have to make themfeel the robustness of the system :-)


Kingsley

Best
M. Hepp



On Jun 22, 2011, at 4:03 PM, Jiří Procházka wrote:

I understand that, but I doubt your conclusion, that those crawlers are
targeting semantic web, since like you said they don't even properly
identify themselves and as far as I know, on Universities also regular
web search and crawling is researched. Maybe lot of them are targeting
semantic web, but we should look at all measures to conserve bandwidth,
from avoiding regular web crawler interest, aiding infrastructure like
Ping the Semantic Web to optimizing delivery and even distribution of
the data among resouces.

Best,
Jiri

On 06/22/2011 03:21 PM, Martin Hepp wrote:

Thanks, Jiri, but the load comes from academic crawler prototypes firing from 
broad University infrastructures.
Best
Martin


On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:

I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware ones do?
I am not sure how the current practice of linking by link tag in the
html headers could cause this, but it may be case that those heavy loads
come from a crawlers having nothing to do with semantic web...
Maybe we should start linking to our rdf/xml, turtle, ntriples files and
publishing sitemap info in RDFa...

Best,
Jiri

On 06/22/2011 09:00 AM, Steve Harris wrote:

While I don't agree with Andreas exactly that it's the site owners fault, this 
is something that publishers of non-semantic data have to deal with.

If you publish a large collection of interlinked data which looks interesting 
to conventional crawlers and is expensive to generate, conventional web 
crawlers will be all over it. The main difference is that a greater percentage 
of those are written properly, to follow robots.txt and the guidelines about 
hit frequency (maximum 1 request per second per domain, no parallel crawling).

Has someone published similar guidelines for semantic web crawlers?

The ones that don't behave themselves get banned, either in robots.txt, or 
explicitly by the server.

- Steve

On 2011-06-22, at 06:07, Martin Hepp wrote:

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.

One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump files.
This could end LOD experiments for that site.


Best

Martin


On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:

Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:

Hi all:

For the third time in a few weeks, we had massive complaints from site-owners 
that Semantic Web crawlers from Universities visited their sites in a way close 
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a 
parallelized approach.

It's clear that a single, stupidly written crawler script, run from a powerful 
University network, can quickly create terrible traffic load.

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked 
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and 
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say 
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write simple 
crawler scripts for the billion triples challenge or whatsoever without familiarizing 
themselves with the state of the art in "friendly crawling".

Best wishes

Martin Hepp



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to