Hi, let me get back to this thread for two reasons. 1) I was wondering whether the report on DBPedia queries cited below was already published. 2) I have recently tried to use DBPedia for some simple computation and I have a problem. Basically a query for all cities whose population is larger than that of the countries they are in returns a random number of results. I suspect this is due to hitting some internal computation load limits, and there is not much I can do with limits, I think, as results are no more than 20 o so.
Now, I discovered this by chance. If this due to some limits, I would much better prefer an error message (query too expensive) than partial results. Is there a way to detect that these results are partial ? Otherwise, there is s full range of use cases that gets problematic. I know dbpedia is a best effort free resource, so I understand the need for limits, and unpredictable results are good enough for many demos. But being unable to tell if a result is complete or not is a big constraint in many applications best, Andrea Il giorno 19/apr/2013, alle ore 02:54, Kingsley Idehen <[email protected]> ha scritto: > On 4/18/13 7:06 PM, Andrea Splendiani wrote: >> Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen >> <[email protected]> ha scritto: >> >>> On 4/18/13 9:23 AM, Andrea Splendiani wrote: >>>> Hi, >>>> >>>> I think that some caching with a minimum of query rewriting would get read >>>> of 90% of the select{?s ?p ?o} where {?s?p ?o} queries. >>> Sorta. >>> Client queries are inherently unpredictable. That's always been the case, >>> and that predates SPARQL. These issues also exist in the SQL RDBMS realm, >>> which is why you don't have SQL endpoints delivering what SPARQL endpoints >>> provide. >> I know, but I suspect that these days lot of these "intensive" queries are >> explorative, just to check what is in the dataset, and may end up being very >> similar in structure. > > Note, we have logs and recordings of queries that hit many of our public > endpoints. For instance, we are preparing a report on DBpedia that will > actually shed light on types and complexity of queries that hit the DBpedia > endpoint. > >> Jerven: can you report on your experience in this ? How much of problematic >> queries are not really targeted, but more generic ? >> >>>> From a user perspective, I would rather have a clear result code upfront >>>> telling me: your query is to heavy, not enough resources and so on, than >>>> partial results + extra codes. >>> Yes, and you get that in some solutions e.g., what we provide. Basically, >>> our server (subject to capacity) will tell you immediately that your query >>> exceeds the query cost limits (this is different from timeout limits). The >>> aforementioned feature was critical to getting the DBpedia SPARQL endpoint >>> going, years ago. >> Can you make a precise estimation of the query cost, or do you rely on some >> heuristics ? > > We have a query cost optimizer. It handles native and distributed queries. Of > course, query optimization is a universe onto itself, but over the years > we've continually applied what we've learned about queries into its continued > evolution. > >> >>>> I won't do much of partial results anyway... so it's time wasted both >>>> sides. >>> Not in a world where you have a public endpoint and zero control over the >>> queries issued by clients. >>> Not in a world where you to provide faceted navigation over entity >>> relations as part of a "precision find" style service atop RDF based Linked >>> Data etc.. >> I mean, partial results are ok if I have control on which part it is... a >> system-dependent random subset of results is not much useful (not even for >> statistics!) > > You have control with our system because you are basically given the ability > to retry using a heuristic that increases the total processing time per > retry, at the same time, while you are making up your mind about whether > retry or not, there are background activities running in relation your last > query. Remember, query processing is comprised of many parts: > > 1. parsing > 2. costing > 3. solution > 4. actual retrieval. > > Many see 1-4 as a monolith. Not so when dealing with DBMS processing. Again, > this is novel, its quite old in the RDBMS realm. > >> >>>> One empiric solution could be to assign a quota per requesting IP (or >>>> other form of identification). >>> That's but one coarse-grained factor. You need to be able to associate a >>> user agent (human or machine) profile with what ever quality of service you >>> seek to scope to said profile. Again, this is the kind of thing we offer by >>> leveraging WebID, Inference, and RDF right inside the core DBMS engine. >> I agree. The finer the better. The IP-based approach is perhaps relatively >> easy to implement if not much is provided by the system. >> >>>> Then one could restrict the total amount of resource per time-frame, >>>> possibly with smart policies. >>> "Smart Policies" are the kind of thing you produce by exploiting the kind >>> or entity relationship semantics baked into RDF based Linked Data. >>> Basically, OWL (which is all about describing entity types and relation >>> types semantics) serves this purpose very well. We certainly put it to use >>> in our data access policy system which enables us to offer different >>> capabilities and resource consumption to different human- or machine-agent >>> profiles. >> How do you use OWL for this ? > > We just have normal RDF graphs that describe data access policies. All you > need is a Linked Data URI that denotes an Agent (human or machine) , agent > profile oriented ontology (e.g., FOAF and the like) that defines entity types > and relation types, a Web access control ontology, actual entity relations > based on the aforementioned ontologies, and reasoning capability. Basically, > using RDF to do what it's actually designed to do. > >> >>>> It would also avoid people breaking big queries in many small ones... >>> You can't avoid bad or challenging queries. What you can do is look to >>> fine-grained data access policies (semantically enhanced ACLs) to address >>> this problem. This has always been the challenge, even before the emergence >>> of the whole Semantic Web , RDF etc.. The same challenges also dogged the >>> RDBMS realm. There is no dancing around this matter when dealing with >>> traditional RDBMS or Web oriented data access. >>>> But I was wondering: why is resource consumption a problem for sparql >>>> endpoint providers, and not for other "providers" on the web ? (say, >>>> YouTube, Google, ...). >>>> Is it the unpredictability of the resources needed ? >>> Good question! >>> >>> They hide the problem behind airport sized data centers, and then they get >>> you to foot the bill via your profile data which ultimately compromises >>> your privacy. >> Isn't the same possible with sparql, in principle ? > > Sorta. > > The ultimate destination, in our opinion is this, which setup provides the > most cost-effective solution for Linked Data exploitation, at Web-scale. > Basically, how many machines do you need to provide acceptable performance to > a variety of user and agent profiles. We don't believe you need an airport > sized data center to pull that off. The LOD cloud cache is just an 12-node > Virtuoso cluster split across 4 machines. > > "OpenLink Virtuoso version 07.00.3202, on Linux (x86_64-unknown-linux-gnu), > Cluster Edition(12 server processes, 756 GB total memory)" > > That's at the footer of the system home page: http://lod.openlinksw.com. > Likewise, we expose timing and resource utilization data per query processed > via that interface. > >> Although, I guess if a company would know that you spy on their queries... >> there would be some issue (unlike for users and facebook, for some reason). > > It works like this: > > 1. we put out a public endpoint > 2. we allow the public certain kinds of access e.g., like the DBpedia fair > use policy we have in place > 3. we can provide special access to specific agents based on data access > policy graphs scoped to their WebIDs or other types of identifiers. > > Kingsley > > >> >> best, >> Andrea >> >>> This is a problem, and it's the ultimately basis for showcasing what RDF >>> (entity relationship based data model endowed with *explicit* rather than >>> *implicit* human- and machine-readable entity relationship semantics) is >>> actually all about. >>> >>> >>> Kingsley >>>> best, >>>> Andrea >>>> >>>> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman >>>> <[email protected]> ha scritto: >>>> >>>>> Hi All, >>>>> >>>>> Managing a public SPARQL endpoint has some difficulties in comparison to >>>>> managing a simpler REST api. >>>>> Instead of counting api calls or external bandwidth use we need to look >>>>> at internal IO and CPU usage as well. >>>>> >>>>> Many of the current public SPARQL endpoints limit all their users to >>>>> queries of limited CPU time. >>>>> But this is not enough to really manage (mis) use of an endpoint. Also >>>>> the SPARQL api being http based >>>>> suffers from the problem that we first send the status code and may only >>>>> find out later that we can't >>>>> answer the query after all. Leading to a 200 not OK problem :( >>>>> >>>>> What approaches can we come up with as a community to embedded resource >>>>> limit exceeded exceptions in the >>>>> SPARQL protocols. e.g. we could add an exception element to the sparql >>>>> xml result format.[1] >>>>> >>>>> The current limits to CPU use are not enough to really avoid misuse. >>>>> Which is why I submitted a patch to >>>>> Sesame that allows limits on memory use as well. Although limits on disk >>>>> seeks or other IO counts may be needed by some as well. >>>>> >>>>> But these are currently hard limits what I really want is >>>>> "playground limits" i.e. you can use the swing as much as you want if you >>>>> are the only child in the park. >>>>> Once there are more children you have to share. >>>>> >>>>> And how do we communicate this to our users. i.e. this result set is >>>>> incomplete because you exceeded your IO >>>>> quota please break up your queries in smaller blocks. >>>>> >>>>> For my day job where I do manage a 7.4 billion triple store with public >>>>> access some extra tools in managing users would be >>>>> great. >>>>> >>>>> Last but not least how can we avoid that users need to run SELECT >>>>> (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends. >>>>> For beta.sparql.uniprot.org I have been moving much of this information >>>>> into the sparql endpoint description but its not a place >>>>> where people look for this information. >>>>> >>>>> Regards, >>>>> Jerven >>>>> >>>>> [1] Yeah these ideas are not great timing just after 1.1 but we can >>>>> always start SPARQL 1.2 ;) >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> Jerven Bolleman [email protected] >>>>> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 >>>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 >>>>> 1211 Geneve 4, >>>>> Switzerland www.isb-sib.ch - www.uniprot.org >>>>> Follow us at https://twitter.com/#!/uniprot >>>>> ------------------------------------------------------------------- >>>>> >>>>> >>>> >>>> >>>> >>> >>> -- >>> >>> Regards, >>> >>> Kingsley Idehen >>> Founder & CEO >>> OpenLink Software >>> Company Web: http://www.openlinksw.com >>> Personal Weblog: http://www.openlinksw.com/blog/~kidehen >>> Twitter/Identi.ca handle: @kidehen >>> Google+ Profile: https://plus.google.com/112399767740508618350/about >>> LinkedIn Profile: http://www.linkedin.com/in/kidehen >>> >>> >>> >>> >>> >> >> >> >> > > > -- > > Regards, > > Kingsley Idehen > Founder & CEO > OpenLink Software > Company Web: http://www.openlinksw.com > Personal Weblog: http://www.openlinksw.com/blog/~kidehen > Twitter/Identi.ca handle: @kidehen > Google+ Profile: https://plus.google.com/112399767740508618350/about > LinkedIn Profile: http://www.linkedin.com/in/kidehen > > > > >
