Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

Kingsley Idehen Thu, 18 Apr 2013 08:05:36 -0700

On 4/18/13 9:23 AM, Andrea Splendiani wrote:

Hi,


I think that some caching with a minimum of query rewriting would get read of 
90% of the select{?s ?p ?o} where {?s?p ?o} queries.


Sorta.

Client queries are inherently unpredictable. That's always been the case, and that predates SPARQL. These issues also exist in the SQL RDBMS realm, which is why you don't have SQL endpoints delivering what SPARQL endpoints provide.


 From a user perspective, I would rather have a clear result code upfront 
telling me: your query is to heavy, not enough resources and so on, than 
partial results + extra codes.

Yes, and you get that in some solutions e.g., what we provide. Basically, our server (subject to capacity) will tell you immediately that your query exceeds the query cost limits (this is different from timeout limits). The aforementioned feature was critical to getting the DBpedia SPARQL endpoint going, years ago.

I won't do much of partial results anyway... so it's time wasted both sides.

Not in a world where you have a public endpoint and zero control over the queries issued by clients. Not in a world where you to provide faceted navigation over entity relations as part of a "precision find" style service atop RDF based Linked Data etc..


One empiric solution could be to assign a quota per requesting IP (or other 
form of identification).

That's but one coarse-grained factor. You need to be able to associate a user agent (human or machine) profile with what ever quality of service you seek to scope to said profile. Again, this is the kind of thing we offer by leveraging WebID, Inference, and RDF right inside the core DBMS engine.

  Then one could restrict the total amount of resource per time-frame, possibly 
with smart policies.

"Smart Policies" are the kind of thing you produce by exploiting the kind or entity relationship semantics baked into RDF based Linked Data. Basically, OWL (which is all about describing entity types and relation types semantics) serves this purpose very well. We certainly put it to use in our data access policy system which enables us to offer different capabilities and resource consumption to different human- or machine-agent profiles.

It would also avoid people breaking big queries in many small ones...

You can't avoid bad or challenging queries. What you can do is look to fine-grained data access policies (semantically enhanced ACLs) to address this problem. This has always been the challenge, even before the emergence of the whole Semantic Web , RDF etc.. The same challenges also dogged the RDBMS realm. There is no dancing around this matter when dealing with traditional RDBMS or Web oriented data access.


But I was wondering: why is resource consumption a problem for sparql endpoint providers, 
and not for other "providers" on the web ? (say, YouTube, Google, ...).
Is it the unpredictability of the resources needed ?


Good question!

They hide the problem behind airport sized data centers, and then they get you to foot the bill via your profile data which ultimately compromises your privacy.

This is a problem, and it's the ultimately basis for showcasing what RDF (entity relationship based data model endowed with *explicit* rather than *implicit* human- and machine-readable entity relationship semantics) is actually all about.



Kingsley


best,
Andrea

Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman 
<[email protected]> ha scritto:

Hi All,

Managing a public SPARQL endpoint has some difficulties in comparison to 
managing a simpler REST api.
Instead of counting api calls or external bandwidth use we need to look at 
internal IO and CPU usage as well.

Many of the current public SPARQL endpoints limit all their users to queries of 
limited CPU time.
But this is not enough to really manage (mis) use of an endpoint. Also the 
SPARQL api being http based
suffers from the problem that we first send the status code and may only find 
out later that we can't
answer the query after all. Leading to a 200 not OK problem :(

What approaches can we come up with as a community to embedded resource limit 
exceeded exceptions in the
SPARQL protocols. e.g. we could add an exception element to the sparql xml 
result format.[1]

The current limits to CPU use are not enough to really avoid misuse. Which is 
why I submitted a patch to
Sesame that allows limits on memory use as well. Although limits on disk seeks 
or other IO counts may be needed by some as well.

But these are currently hard limits what I really want is
"playground limits" i.e. you can use the swing as much as you want if you are 
the only child in the park.
Once there are more children you have to share.

And how do we communicate this to our users. i.e. this result set is incomplete 
because you exceeded your IO
quota please break up your queries in smaller blocks.

For my day job where I do manage a 7.4 billion triple store with public access 
some extra tools in managing users would be
great.

Last but not least how can we avoid that users need to run SELECT 
(COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
For beta.sparql.uniprot.org I have been moving much of this information into 
the sparql endpoint description but its not a place
where people look for this information.

Regards,
Jerven

[1] Yeah these ideas are not great timing just after 1.1 but we can always 
start SPARQL 1.2 ;)



-------------------------------------------------------------------
Jerven Bolleman                        [email protected]
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------



--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

Reply via email to