Re: Can we afford to offer SPARQL endpoints when we are successful? (Was "linked data hosted somewhere")

Kingsley Idehen Wed, 26 Nov 2008 19:53:38 -0800


Peter Ansell wrote:

2008/11/27 Hugh Glaser <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
    Prompted by the thread on "linked data hosted somewhere" I would
    like to ask
    the above question that has been bothering me for a while.

    The only reason anyone can afford to offer a SPARQL endpoint is
    because it
    doesn't get used too much?

    As abstract components for studying interaction, performance, etc.:
    DB=KB, SQL=SPARQL.
    In fact, I often consider the components themselves
    interchangeable; that
    is, the first step of the migration to SW technologies for an
    application is
    to take an SQL-based back end and simply replace it with a
    SPARQL/RDF back
    end and then carry on.

    However.
    No serious DB publisher gives direct SQL access to their DB (I think).
    There are often commercial reasons, of course.
    But even when there are not (the Open in LOD), there are only
    search options
    and possibly download facilities.
    Even government organisations that have a remit to publish their
    data don't
    offer SQL access.

    Will we not have to do the same?
    Or perhaps there is a subset of SPARQL that I could offer that
    will allow me
    to offer a "safer" service that conforms to other's safer service
    (so it is
    well-understood?
    Is this defined, or is anyone working on it?

    And I am not referring to any particular software - it seems to me
    that this
    is something that LODers need to worry about.
    We aim to take over the world; and if SPARQL endpoints are part of
    that
    (maybe they aren't - just resolvable URIs?), then we should make
    damn sure
    that we think they can be delivered.

    My answer to my subject question?
    No, not as it stands. And we need to have a story to replace it.

    Best
    Hugh
I don't think we can afford to offer the actual public gradeinfrastructure for free unless there is corporate backing forparticular endpoints. However, we can still tentatively roll outSPARQL endpoints and resolvers in mirror configurations together withsoftware which can round robin across the endpoints to get informationwithout overloading a particular endpoint to at least get someredundancy and figure out what needs to be done to fine tune themethods for distributed queries. Once you have the ability to roundrobin across sparql endpoints and still choose them intelligentlybased on a knowledge of what is inside each one you can distribute thesource RDF to anyone and have them give back the information about howto access the endpoint, and if people are found to be overloading anendpoint send them a polite message to either round robin across theavailable endpoints or get their own local SPARQL installation whichcan be configured to respond to work the same as the public endpoint.
An example implementation of this functionality is the distribution ofqueries across endpoints for Bio2RDF [1] which together with thedistribution of a combination of Virtuoso DB files [2] and sourceNTriples files [3] make it relatively simple for people to downloadthe software [4], and the resolver package and redirect theconfiguration file to their own local versions for large scale privateuse of semantics using exactly the same URI's that resolve using acombination of the publically available resolvers which may or may notbe contacting public SPARQL endpoints. An example of a public resolvercontacting a combination of public and private SPARQL endpoints is[5]. (Please don't go and overload it though because as Hugh says, thethreat of overloading is quite real for any particular endpoint :) ).

Peter,

If you configure the Virtuoso INI file appropriately the deliberate orinadvertent DOS vulnerability is alleviated.


You can append this to your Virtuoso INI (if not there already):

[SPARQL]
ResultSetMaxRows           = 1000
DefaultGraph               = http://bio2rdf.org
MaxQueryExecutionTime      = 60  ; seconds
MaxQueryCostEstimationTime = 400 ; seconds
DefaultQuery               = select distinct ?Concept where {[] a ?Concept}

I do agree that arbitrary SPARQL queries should be localised toprivate installations, but before you do that you have to provide easyways for people to get private installations which resolve URI's inthe same way that they are in the public web.

We have also made this part of the DBpedia on EC2 solution, thus, theURIs are localized while retaining original data source links byattribution etc.So <http://<ec2-cname>/resource/Berlin> will be resolved locally willusing an attribution link (dc:source) to<http://dbpedia.org/resource/Berlin> . The attribution triple doesn'texist in the quad store (so it doesn't result in one for each resourcethereby increasing size unnecessarily), we simply produce it "on thefly" via a re-write rule.


Kingsley

Cheers,

Peter

[1] http://bio2rdf.mquter.qut.edu.au/admin/configuration/rdfxml
[2] http://quebec.bio2rdf.org/download/virtuoso/indexed/
[3] http://quebec.bio2rdf.org/download/n3/
[4] http://sourceforge.net/project/platformdownload.php?group_id=142631
[5] http://bio2rdf.mquter.qut.edu.au/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the GoogleGroups "bio2rdf" group.
To post to this group, send email to [EMAIL PROTECTED]
To unsubscribe from this group, send email to[EMAIL PROTECTED]For more options, visit this group athttp://groups.google.com/group/bio2rdf?hl=en
-~----------~----~----~----~------~----~------~--~---



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: Can we afford to offer SPARQL endpoints when we are successful? (Was "linked data hosted somewhere")

Reply via email to