Re: Think before you write Semantic Web crawlers

Henry Story Tue, 21 Jun 2011 04:09:21 -0700

On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:

> On 6/21/11 10:54 AM, Henry Story wrote:
>> 
>> Then you could just redirect him straight to the n3 dump of graphs of your 
>> site (I say graphs because your site not necessarily being consistent, the 
>> crawler may be interested in keeping information about which pages said what)
>> Redirect may be a bit harsh. So you could at first link him to the dump
> 
> Only trouble with the above, is that many don't produce graph dumps anymore, 
> they just have SPARQL endpoints, then you pound the endpoints and hit 
> timeouts etc..


I would say it is even more important to place SPARQL endpoints behind WebID 
authentication. If you don't do that you 
are open to horrendous queries being asked that would be better solved by 
downloading the dump. Also there is no way of distinguishing good customers 
from bad ones, and so you end up serving everyone badly. 

The closest similar thing to  SPARQL endpoints on the web are search engines 
query interfaces. But they purposefully limited the queries they had to answer 
to simple + - logic.  And for engines like AltaVista which was owned by Digital 
Equipment Corporation (DEC) a hardware manufacturer, the point was to show off 
the power of their 64 bit chips and hardware in 1995. The more load those 
servers could take the stronger their marketing for their hardware could then 
be.
So their business model was to sell hardware. Unless you want everyone to 
deploy huge numbers of machines for every sparql endpoint - and support the 
construction of a large number of nuclear power stations to feed that need - 
you need to control access more carefully at the source. The best policy is 
allow all access, but keep an eye open for abuse. Here also a WebID pointing to 
an e-mail address or pingback endpoint could be very useful. 

:spider a web:Crawler;
   foaf:mbox <mailto:[email protected]>;
   doap:project <http://gitub.org/rdf-crawler/> .

Information like that could be very useful of course.
> 
> A looong time ago, very early LOD days, we (LOD community) talked about the 
> importance of dumps with the heuristic you describe in mind (no WebID then, 
> but it was clear something would emerge). Unfortunately, SPARQL endpoints 
> have become the first point of call re. Linked Data even though SPARQL 
> endpoint only == asking for trouble if you can self protect the endpoint and 
> re-route agents to dumps.

yes,  a sparql in an unwise hand can lead to serious explosions.

> Maybe we can use WebID and recent troubles as basis for reestablishing this 
> most vital of best practices re. Linked Data publication. Of course, this is 
> also awesome dog-fooding too!

The WebID community (nee foaf+ssl) is really keen to help I am sure. We have 
libs in all languages ready to go. WebID is especially easy to implement for 
server to server communication btw.

Henry

> 
> -- 
> 
> Regards,
> 
> Kingsley Idehen       
> President&  CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
> 
> 
> 

Social Web Architect
http://bblfish.net/

Re: Think before you write Semantic Web crawlers

Reply via email to