Re: Think before you write Semantic Web crawlers

Kingsley Idehen Thu, 23 Jun 2011 00:17:46 -0700

On 6/22/11 11:26 PM, Henry Story wrote:

On 23 Jun 2011, at 00:11, Alexandre Passant wrote:

On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:

On 21 Jun 2011, at 10:44, Martin Hepp wrote:

PS: I will not release the IP ranges from which the trouble originated, but 
rest assured, there were top research institutions among them.

The right answer is: name and shame. That is the way to teach them.

You may have find the right word: teach.
We've (as academic) given tutorials on how to publish and consume LOD, lots of 
things about best practices for publishing, but not much about consuming.
Why not simply coming with reasonable guidelines for this, that should also be 
taught in institutes / universities where people use LOD, and in tutorials 
given in various conferences.

That is of course a good idea. But longer term you don't want to teach that 
way. It's too time consuming. You need the machines to do the teaching.

Think about Facebook. How did 500 million people go to use it? Because they 
were introduced by friends, by using it, but not by doing tutorials and going 
to courses. The system itself teaches people how to use it.

So the same way, if you want to teach people linked data, get the social web 
going and they will learn the rest by themselves. If you want to teach crawlers 
to behave, make bad behaviour uninteresting. Create a game and rules where good 
behaviour are rewarded and bad behaviour has the opposite effect.

This is why I think using WebID can help. You can use the information to build 
lists and rankings of good and bad crawlers, people with good crawlers get to 
present papers and crawling confs, bad crawlers get throttled out of crawling.  
Make it so that the system can grow beyond academic and teaching settings, into 
the world of billions of users spread across the world, living in different 
political institutions and speaking different languages. We have had good 
crawling practices since the beginning of the web, but you need to make them 
evident and self teaching.

EG. A crawler that crawls to much will get slowed down, and redirected to pages 
on crawling behavior, written and translated into every single language on the 
planet.


+1000

That's the game in a nutshell!

We have to keep virtuous cycles at the core of the increasingly social Web.

Kingsley

Henry

m2c

Alex.

Like Karl said, we should collect information about abusive crawlers so that 
site operators can defend themselves. It won't be *that* hard to research and 
collect the IP ranges of offending universities.

I started a list here:
http://www.w3.org/wiki/Bad_Crawlers

The list is currently empty. I hope it stays that way.

Thank you all,
Richard

--
Dr. Alexandre Passant,
Social Software Unit Leader
Digital Enterprise Research Institute,
National University of Ireland, Galway

Social Web Architect
http://bblfish.net/



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to