On 6/21/11 9:41 AM, Henry Story wrote:
A solution to stupid crawlers would be to put the linked data behind https
endpoints, and use WebID
for authentication. You could still allow everyone access, but at least you
would force the crawler to identify
himself, and use these WebIDs to learn who was making the crawler. This could
then be used as a piece of the evaluation of the quality of a semantic web
stack.
+1000
Been holding my tongue on that one!!
Kingsley
Henry
10 minute intro to WebID http://bblfish.net/blog/2011/05/25/ (in browsers, but
the browser is not really necessary)
On 21 Jun 2011, at 09:49, Martin Hepp wrote:
Hi all:
For the third time in a few weeks, we had massive complaints from site-owners
that Semantic Web crawlers from Universities visited their sites in a way close
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a
parallelized approach.
It's clear that a single, stupidly written crawler script, run from a powerful
University network, can quickly create terrible traffic load.
Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein,
- used no mechanisms at all for limiting the default crawling speed and
re-crawling delays.
This irresponsible behavior can be the final reason for site-owners to say
farewell to academic/W3C-sponsored semantic technology.
So please, please - advise all of your colleagues and students to NOT write simple
crawler scripts for the billion triples challenge or whatsoever without familiarizing
themselves with the state of the art in "friendly crawling".
Best wishes
Martin Hepp
Social Web Architect
http://bblfish.net/
--
Regards,
Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen