Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

Andrea Splendiani Thu, 18 Apr 2013 16:10:10 -0700

Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen <[email protected]> 
ha scritto:


> On 4/18/13 9:23 AM, Andrea Splendiani wrote:
>> Hi,
>> 
>> I think that some caching with a minimum of query rewriting would get read 
>> of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.
> Sorta.
> Client queries are inherently unpredictable. That's always been the case, and 
> that predates SPARQL. These issues also exist in the SQL RDBMS realm, which 
> is why you don't have SQL endpoints delivering what SPARQL endpoints provide.
I know, but I suspect that these days lot of these "intensive" queries are 
explorative, just to check what is in the dataset, and may end up being very 
similar in structure.
Jerven: can you report on your experience in this ? How much of problematic 
queries are not really targeted, but more generic ?

>> From a user perspective, I would rather have a clear result code upfront 
>> telling me: your query is to heavy, not enough resources and so on, than 
>> partial results + extra codes.
> Yes, and you get that in some solutions e.g., what we provide. Basically, our 
> server (subject to capacity) will tell you immediately that your query 
> exceeds the query cost limits (this is different from timeout limits). The 
> aforementioned feature was critical to getting the DBpedia SPARQL endpoint 
> going, years ago.
Can you make a precise estimation of the query cost, or do you rely on some 
heuristics ?

>> I won't do much of partial results anyway... so it's time wasted both sides.
> Not in a world where you have a public endpoint and zero control over the 
> queries issued by clients.
> Not in a world where you to provide faceted navigation over entity relations 
> as part of a "precision find" style service atop RDF based Linked Data etc..
I mean, partial results are ok if I have control on which part it is... a 
system-dependent random subset of results is not much useful (not even for 
statistics!)

>> One empiric solution could be to assign a quota per requesting IP (or other 
>> form of identification).
> That's but one coarse-grained factor. You need to be able to associate a user 
> agent (human or machine) profile with what ever quality of service you seek 
> to scope to said profile. Again, this is the kind of thing we offer by 
> leveraging WebID, Inference, and RDF right inside the core DBMS engine.
I agree. The finer the better. The IP-based approach is perhaps relatively easy 
to implement if not much is provided by the system.

> 
>>  Then one could restrict the total amount of resource per time-frame, 
>> possibly with smart policies.
> "Smart Policies" are the kind of thing you produce by exploiting the kind or 
> entity relationship semantics baked into RDF based Linked Data. Basically, 
> OWL (which is all about describing entity types and relation types semantics) 
> serves this purpose very well. We certainly put it to use in our data access 
> policy system which enables us to offer different capabilities and resource 
> consumption to different human- or machine-agent profiles.
How do you use OWL for this ?

> 
>> It would also avoid people breaking big queries in many small ones...
> You can't avoid bad or challenging queries. What you can do is look to 
> fine-grained data access policies (semantically enhanced ACLs) to address 
> this problem. This has always been the challenge, even before the emergence 
> of the whole Semantic Web , RDF etc.. The same challenges also dogged the 
> RDBMS realm. There is no dancing around this matter when dealing with 
> traditional RDBMS or Web oriented data access.
>> 
>> But I was wondering: why is resource consumption a problem for sparql 
>> endpoint providers, and not for other "providers" on the web ? (say, 
>> YouTube, Google, ...).
>> Is it the unpredictability of the resources needed ?
> 
> Good question!
> 
> They hide the problem behind airport sized data centers, and then they get 
> you to foot the bill via your profile data which ultimately compromises your 
> privacy.
Isn't the same possible with sparql, in principle ?
Although, I guess if a company would know that you spy on their queries... 
there would be some issue (unlike for users and facebook, for some reason).

best,
Andrea

> 
> This is a problem, and it's the ultimately basis for showcasing what RDF 
> (entity relationship based data model endowed with *explicit* rather than 
> *implicit* human- and machine-readable entity relationship semantics) is 
> actually all about.
> 
> 
> Kingsley
>> 
>> best,
>> Andrea
>> 
>> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman 
>> <[email protected]> ha scritto:
>> 
>>> Hi All,
>>> 
>>> Managing a public SPARQL endpoint has some difficulties in comparison to 
>>> managing a simpler REST api.
>>> Instead of counting api calls or external bandwidth use we need to look at 
>>> internal IO and CPU usage as well.
>>> 
>>> Many of the current public SPARQL endpoints limit all their users to 
>>> queries of limited CPU time.
>>> But this is not enough to really manage (mis) use of an endpoint. Also the 
>>> SPARQL api being http based
>>> suffers from the problem that we first send the status code and may only 
>>> find out later that we can't
>>> answer the query after all. Leading to a 200 not OK problem :(
>>> 
>>> What approaches can we come up with as a community to embedded resource 
>>> limit exceeded exceptions in the
>>> SPARQL protocols. e.g. we could add an exception element to the sparql xml 
>>> result format.[1]
>>> 
>>> The current limits to CPU use are not enough to really avoid misuse. Which 
>>> is why I submitted a patch to
>>> Sesame that allows limits on memory use as well. Although limits on disk 
>>> seeks or other IO counts may be needed by some as well.
>>> 
>>> But these are currently hard limits what I really want is
>>> "playground limits" i.e. you can use the swing as much as you want if you 
>>> are the only child in the park.
>>> Once there are more children you have to share.
>>> 
>>> And how do we communicate this to our users. i.e. this result set is 
>>> incomplete because you exceeded your IO
>>> quota please break up your queries in smaller blocks.
>>> 
>>> For my day job where I do manage a 7.4 billion triple store with public 
>>> access some extra tools in managing users would be
>>> great.
>>> 
>>> Last but not least how can we avoid that users need to run SELECT 
>>> (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>>> For beta.sparql.uniprot.org I have been moving much of this information 
>>> into the sparql endpoint description but its not a place
>>> where people look for this information.
>>> 
>>> Regards,
>>> Jerven
>>> 
>>> [1] Yeah these ideas are not great timing just after 1.1 but we can always 
>>> start SPARQL 1.2 ;)
>>> 
>>> 
>>> 
>>> -------------------------------------------------------------------
>>> Jerven Bolleman                        [email protected]
>>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>>> 1211 Geneve 4,
>>> Switzerland     www.isb-sib.ch - www.uniprot.org
>>> Follow us at https://twitter.com/#!/uniprot
>>> -------------------------------------------------------------------
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> Regards,
> 
> Kingsley Idehen       
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca handle: @kidehen
> Google+ Profile: https://plus.google.com/112399767740508618350/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
> 
> 
> 
> 
>

Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

Reply via email to