On 6/21/11 10:44 AM, Martin Hepp wrote:
Hi Christoper, Henry, all:
The main problem is, imho:
1. the basic attitude of Semantic Web research that the works done in the past or in
other communities were irrelevant historical relicts (databases, middleware, EDI) and
that the old fellows were simply too stupid to understand the "power of semantics
that will make machines understand our data with ease", just by adding a bit of OWL
2 DL axioms, properly dereferencing data entity URIs according to their nice data
publishing guidelines that turn toy examples into a magic art;
2. this implanted into the heads of eager young people who excelled in the "AI for freshmen",
"complexity theory", and "theorem proving" exams and who now apply the gained
self-confidence from a small subset of life to a broader range of fields, and
3. allocating a lot of money (EU funding) and an abundance of IT resources
(university servers, bandwidth,...) to those folks.
This mindset is the petri dish for stupid crawlers as described.
It's a jungle out there on the InterWeb. You can only survive via self
protection. The whole crawling issue is quite complex. WebID is one way
of controlling matters. Basically, WebID is an API Key++. As a community
(by this I mean Linked Data specifically) we should put WebID to max
use. Make those Agents identify themselves and use ACLs for QoS etc..
We can't have it both ways, the InterWeb is quite primitive, still, in
many respects. Got to self protect.
Unfortunately, authentication techniques won't help protecting typical
site-owners from the dangerous creatures written by Semantic Web researchers
gathering data for the evaluation of their ISWC 2011 submission, because the
site-owners at www.godaddy.com know nothing about WebID at this point ;-)
Yes, so they get fenced off. No matter how you look at this self
protection is important. As you can imagine, we been out in this jungle
for a while and encountered many species of critter via DBpedia and LOD
Cloud Cache. If you don't self protect some of these critters will suck
the living day light out of your "point of presence on the Web", literally!
Kingsley
Martin
PS: I will not release the IP ranges from which the trouble originated, but
rest assured, there were top research institutions among them.
On Jun 21, 2011, at 10:48 AM, Christopher Gutteridge wrote:
Would some kind of caching crawler mitigate this issue? Have someone write a
well behaved crawler which allowed you to download a recent .ttl.tgz of various
sites. Of course, that assumes the student is able to find such a cache.
Asking people nicely will only work in a very small community.
Henry Story wrote:
A solution to stupid crawlers would be to put the linked data behind https
endpoints, and use WebID
for authentication. You could still allow everyone access, but at least you
would force the crawler to identify
himself, and use these WebIDs to learn who was making the crawler. This could
then be used as a piece of the evaluation of the quality of a semantic web
stack.
Henry
10 minute intro to WebID
http://bblfish.net/blog/2011/05/25/
(in browsers, but the browser is not really necessary)
On 21 Jun 2011, at 09:49, Martin Hepp wrote:
Hi all:
For the third time in a few weeks, we had massive complaints from site-owners
that Semantic Web crawlers from Universities visited their sites in a way close
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a
parallelized approach.
It's clear that a single, stupidly written crawler script, run from a powerful
University network, can quickly create terrible traffic load.
Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and
re-crawling delays.
This irresponsible behavior can be the final reason for site-owners to say
farewell to academic/W3C-sponsored semantic technology.
So please, please - advise all of your colleagues and students to NOT write simple
crawler scripts for the billion triples challenge or whatsoever without familiarizing
themselves with the state of the art in "friendly crawling".
Best wishes
Martin Hepp
Social Web Architect
http://bblfish.net/
--
Christopher Gutteridge --
http://id.ecs.soton.ac.uk/person/1248
You should read the ECS Web Team blog:
http://blogs.ecs.soton.ac.uk/webteam/
--
Regards,
Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen