Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Andrea Splendiani Thu, 30 May 2013 06:16:30 -0700

Hi,

let me get back to this thread for two reasons.
1) I was wondering whether the report on DBPedia queries cited below was 
already published.
2) I have recently tried to use DBPedia for some simple computation and I have 
a problem. Basically a query for all cities whose population is larger than 
that of the countries they are in returns a random number of results. I suspect 
this is due to hitting some internal computation load limits, and there is not 
much I can do with limits, I think, as results are no more than 20 o so.


Now, I discovered this by chance. If this due to some limits, I would much 
better prefer an error message (query too expensive) than partial results. 
Is there a way to detect that these results are partial ? 
Otherwise, there is s full range of use cases that gets problematic. I know 
dbpedia is a best effort free resource, so I understand the need for limits, 
and unpredictable results are good enough for many demos. But being unable to 
tell if a result is complete or not is a big constraint in many applications

best,
Andrea


Il giorno 19/apr/2013, alle ore 02:54, Kingsley Idehen <[email protected]> 
ha scritto:

> On 4/18/13 7:06 PM, Andrea Splendiani wrote:
>> Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen 
>> <[email protected]> ha scritto:
>> 
>>> On 4/18/13 9:23 AM, Andrea Splendiani wrote:
>>>> Hi,
>>>> 
>>>> I think that some caching with a minimum of query rewriting would get read 
>>>> of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.
>>> Sorta.
>>> Client queries are inherently unpredictable. That's always been the case, 
>>> and that predates SPARQL. These issues also exist in the SQL RDBMS realm, 
>>> which is why you don't have SQL endpoints delivering what SPARQL endpoints 
>>> provide.
>> I know, but I suspect that these days lot of these "intensive" queries are 
>> explorative, just to check what is in the dataset, and may end up being very 
>> similar in structure.
> 
> Note, we have logs and recordings of queries that hit many of our public 
> endpoints. For instance, we are preparing a report on DBpedia that will 
> actually shed light on types and complexity of queries that hit the DBpedia 
> endpoint.
> 
>> Jerven: can you report on your experience in this ? How much of problematic 
>> queries are not really targeted, but more generic ?
>> 
>>>> From a user perspective, I would rather have a clear result code upfront 
>>>> telling me: your query is to heavy, not enough resources and so on, than 
>>>> partial results + extra codes.
>>> Yes, and you get that in some solutions e.g., what we provide. Basically, 
>>> our server (subject to capacity) will tell you immediately that your query 
>>> exceeds the query cost limits (this is different from timeout limits). The 
>>> aforementioned feature was critical to getting the DBpedia SPARQL endpoint 
>>> going, years ago.
>> Can you make a precise estimation of the query cost, or do you rely on some 
>> heuristics ?
> 
> We have a query cost optimizer. It handles native and distributed queries. Of 
> course, query optimization is a universe onto itself, but over the years 
> we've continually applied what we've learned about queries into its continued 
> evolution.
> 
>> 
>>>> I won't do much of partial results anyway... so it's time wasted both 
>>>> sides.
>>> Not in a world where you have a public endpoint and zero control over the 
>>> queries issued by clients.
>>> Not in a world where you to provide faceted navigation over entity 
>>> relations as part of a "precision find" style service atop RDF based Linked 
>>> Data etc..
>> I mean, partial results are ok if I have control on which part it is... a 
>> system-dependent random subset of results is not much useful (not even for 
>> statistics!)
> 
> You have control with our system because you are basically given the ability 
> to retry using a heuristic that increases the total processing time per 
> retry, at the same time, while you are making up your mind about whether 
> retry or not, there are background activities running in relation your last 
> query. Remember, query processing is comprised of many parts:
> 
> 1. parsing
> 2. costing
> 3. solution
> 4. actual retrieval.
> 
> Many see 1-4 as a monolith. Not so when dealing with DBMS processing. Again, 
> this is novel, its quite old in the RDBMS realm.
> 
>> 
>>>> One empiric solution could be to assign a quota per requesting IP (or 
>>>> other form of identification).
>>> That's but one coarse-grained factor. You need to be able to associate a 
>>> user agent (human or machine) profile with what ever quality of service you 
>>> seek to scope to said profile. Again, this is the kind of thing we offer by 
>>> leveraging WebID, Inference, and RDF right inside the core DBMS engine.
>> I agree. The finer the better. The IP-based approach is perhaps relatively 
>> easy to implement if not much is provided by the system.
>> 
>>>>  Then one could restrict the total amount of resource per time-frame, 
>>>> possibly with smart policies.
>>> "Smart Policies" are the kind of thing you produce by exploiting the kind 
>>> or entity relationship semantics baked into RDF based Linked Data. 
>>> Basically, OWL (which is all about describing entity types and relation 
>>> types semantics) serves this purpose very well. We certainly put it to use 
>>> in our data access policy system which enables us to offer different 
>>> capabilities and resource consumption to different human- or machine-agent 
>>> profiles.
>> How do you use OWL for this ?
> 
> We just have normal RDF graphs that describe data access policies. All you 
> need is a Linked Data URI that denotes an Agent (human or machine) , agent 
> profile oriented ontology (e.g., FOAF and the like) that defines entity types 
> and relation types, a Web access control ontology, actual entity relations 
> based on the aforementioned ontologies, and reasoning capability.  Basically, 
> using RDF to do what it's actually designed to do.
> 
>> 
>>>> It would also avoid people breaking big queries in many small ones...
>>> You can't avoid bad or challenging queries. What you can do is look to 
>>> fine-grained data access policies (semantically enhanced ACLs) to address 
>>> this problem. This has always been the challenge, even before the emergence 
>>> of the whole Semantic Web , RDF etc.. The same challenges also dogged the 
>>> RDBMS realm. There is no dancing around this matter when dealing with 
>>> traditional RDBMS or Web oriented data access.
>>>> But I was wondering: why is resource consumption a problem for sparql 
>>>> endpoint providers, and not for other "providers" on the web ? (say, 
>>>> YouTube, Google, ...).
>>>> Is it the unpredictability of the resources needed ?
>>> Good question!
>>> 
>>> They hide the problem behind airport sized data centers, and then they get 
>>> you to foot the bill via your profile data which ultimately compromises 
>>> your privacy.
>> Isn't the same possible with sparql, in principle ?
> 
> Sorta.
> 
> The ultimate destination, in our opinion is this, which setup provides the 
> most cost-effective solution for Linked Data exploitation, at Web-scale. 
> Basically, how many machines do you need to provide acceptable performance to 
> a variety of user and agent profiles. We don't believe you need an airport 
> sized data center to pull that off. The LOD cloud cache is just an 12-node 
> Virtuoso cluster split across 4 machines.
> 
> "OpenLink Virtuoso version 07.00.3202, on Linux (x86_64-unknown-linux-gnu), 
> Cluster Edition(12 server processes, 756 GB total memory)"
> 
> That's at the footer of the system home page: http://lod.openlinksw.com. 
> Likewise, we expose timing and resource utilization data per query processed 
> via that interface.
> 
>> Although, I guess if a company would know that you spy on their queries... 
>> there would be some issue (unlike for users and facebook, for some reason).
> 
> It works like this:
> 
> 1. we put out a public endpoint
> 2. we allow the public certain kinds of access e.g., like the DBpedia fair 
> use policy we have in place
> 3. we can provide special access to specific agents based on data access 
> policy graphs scoped to their WebIDs or other types of identifiers.
> 
> Kingsley
> 
> 
>> 
>> best,
>> Andrea
>> 
>>> This is a problem, and it's the ultimately basis for showcasing what RDF 
>>> (entity relationship based data model endowed with *explicit* rather than 
>>> *implicit* human- and machine-readable entity relationship semantics) is 
>>> actually all about.
>>> 
>>> 
>>> Kingsley
>>>> best,
>>>> Andrea
>>>> 
>>>> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman 
>>>> <[email protected]> ha scritto:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> Managing a public SPARQL endpoint has some difficulties in comparison to 
>>>>> managing a simpler REST api.
>>>>> Instead of counting api calls or external bandwidth use we need to look 
>>>>> at internal IO and CPU usage as well.
>>>>> 
>>>>> Many of the current public SPARQL endpoints limit all their users to 
>>>>> queries of limited CPU time.
>>>>> But this is not enough to really manage (mis) use of an endpoint. Also 
>>>>> the SPARQL api being http based
>>>>> suffers from the problem that we first send the status code and may only 
>>>>> find out later that we can't
>>>>> answer the query after all. Leading to a 200 not OK problem :(
>>>>> 
>>>>> What approaches can we come up with as a community to embedded resource 
>>>>> limit exceeded exceptions in the
>>>>> SPARQL protocols. e.g. we could add an exception element to the sparql 
>>>>> xml result format.[1]
>>>>> 
>>>>> The current limits to CPU use are not enough to really avoid misuse. 
>>>>> Which is why I submitted a patch to
>>>>> Sesame that allows limits on memory use as well. Although limits on disk 
>>>>> seeks or other IO counts may be needed by some as well.
>>>>> 
>>>>> But these are currently hard limits what I really want is
>>>>> "playground limits" i.e. you can use the swing as much as you want if you 
>>>>> are the only child in the park.
>>>>> Once there are more children you have to share.
>>>>> 
>>>>> And how do we communicate this to our users. i.e. this result set is 
>>>>> incomplete because you exceeded your IO
>>>>> quota please break up your queries in smaller blocks.
>>>>> 
>>>>> For my day job where I do manage a 7.4 billion triple store with public 
>>>>> access some extra tools in managing users would be
>>>>> great.
>>>>> 
>>>>> Last but not least how can we avoid that users need to run SELECT 
>>>>> (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>>>>> For beta.sparql.uniprot.org I have been moving much of this information 
>>>>> into the sparql endpoint description but its not a place
>>>>> where people look for this information.
>>>>> 
>>>>> Regards,
>>>>> Jerven
>>>>> 
>>>>> [1] Yeah these ideas are not great timing just after 1.1 but we can 
>>>>> always start SPARQL 1.2 ;)
>>>>> 
>>>>> 
>>>>> 
>>>>> -------------------------------------------------------------------
>>>>> Jerven Bolleman                        [email protected]
>>>>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>>>>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>>>>> 1211 Geneve 4,
>>>>> Switzerland     www.isb-sib.ch - www.uniprot.org
>>>>> Follow us at https://twitter.com/#!/uniprot
>>>>> -------------------------------------------------------------------
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> 
>>> Regards,
>>> 
>>> Kingsley Idehen     
>>> Founder & CEO
>>> OpenLink Software
>>> Company Web: http://www.openlinksw.com
>>> Personal Weblog: http://www.openlinksw.com/blog/~kidehen
>>> Twitter/Identi.ca handle: @kidehen
>>> Google+ Profile: https://plus.google.com/112399767740508618350/about
>>> LinkedIn Profile: http://www.linkedin.com/in/kidehen
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> Regards,
> 
> Kingsley Idehen       
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca handle: @kidehen
> Google+ Profile: https://plus.google.com/112399767740508618350/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
> 
> 
> 
> 
>

Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Reply via email to