Re: Think before you write Semantic Web crawlers

Kingsley Idehen Tue, 21 Jun 2011 04:40:30 -0700

On 6/21/11 12:06 PM, Henry Story wrote:

On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:

On 6/21/11 10:54 AM, Henry Story wrote:

Then you could just redirect him straight to the n3 dump of graphs of your site 
(I say graphs because your site not necessarily being consistent, the crawler 
may be interested in keeping information about which pages said what)
Redirect may be a bit harsh. So you could at first link him to the dump

Only trouble with the above, is that many don't produce graph dumps anymore, 
they just have SPARQL endpoints, then you pound the endpoints and hit timeouts 
etc..

I would say it is even more important to place SPARQL endpoints behind WebID 
authentication. If you don't do that you
are open to horrendous queries being asked that would be better solved by 
downloading the dump.

Yes, but even WebID alone doesn't protect against inadvertent DOS. Thisis why the SPARQL engine needs to have server side capabilities thatcontrol:


1. Result Set Size
2. Query fulfilment timeouts
3. Granulary Query Cost Optimizer.

  Also there is no way of distinguishing good customers from bad ones, and so 
you end up serving everyone badly.

Solved when you can make fine grained QoS based on the features above.This is really the start point for serious SPARQL endpoints. It's beenin Virtuoso forever, otherwise DBpedia wouldn't have been possible.Ditto LOD cloud cache, ditto Sindice's endpoint, and lots of other heavyduty endpoints .

The closest similar thing to  SPARQL endpoints on the web are search engines 
query interfaces. But they purposefully limited the queries they had to answer 
to simple + - logic.  And for engines like AltaVista which was owned by Digital 
Equipment Corporation (DEC) a hardware manufacturer, the point was to show off 
the power of their 64 bit chips and hardware in 1995. The more load those 
servers could take the stronger their marketing for their hardware could then 
be.
So their business model was to sell hardware. Unless you want everyone to 
deploy huge numbers of machines for every sparql endpoint - and support the 
construction of a large number of nuclear power stations to feed that need - 
you need to control access more carefully at the source. The best policy is 
allow all access, but keep an eye open for abuse. Here also a WebID pointing to 
an e-mail address or pingback endpoint could be very useful.

:spider a web:Crawler;
    foaf:mbox<mailto:[email protected]>;
    doap:project<http://gitub.org/rdf-crawler/>  .

Information like that could be very useful of course.


It can be more granular than that.

Example rules:

1. Henry (verified via WebID carried by his HTTP User Agent) canexecute queries with higher fulfillment costs than "Joe SemWeb ProjectResearcher" (who doesn't have a WebID)2. Queries from a given domain, for a User Agent with a WebID canexecute queries that require N milliseconds for fulfillment planconstruction

3. Ditto but for actual time
4. Ditto but for partial results
5. etc...

A looong time ago, very early LOD days, we (LOD community) talked about the 
importance of dumps with the heuristic you describe in mind (no WebID then, but 
it was clear something would emerge). Unfortunately, SPARQL endpoints have 
become the first point of call re. Linked Data even though SPARQL endpoint only 
== asking for trouble if you can self protect the endpoint and re-route agents 
to dumps.

yes,  a sparql in an unwise hand can lead to serious explosions.


Yes, and in the InterWeb jungle you have to assume everyone is unwise :-)

Maybe we can use WebID and recent troubles as basis for reestablishing this 
most vital of best practices re. Linked Data publication. Of course, this is 
also awesome dog-fooding too!

The WebID community (nee foaf+ssl) is really keen to help I am sure. We have 
libs in all languages ready to go. WebID is especially easy to implement for 
server to server communication btw.


Yep!

Links:

1.http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtAuthPolicyFOAFSSL-- Virtuoso and WebID protection of SPARQL endpoints2.http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtOAuthSPARQL-- Virtuoso and OAuth based protection of SPARQL endpoints



Kingsley

Henry

--

Regards,

Kingsley Idehen 
President&   CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Social Web Architect
http://bblfish.net/



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to