Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen <[email protected]>
ha scritto:
> On 4/18/13 9:23 AM, Andrea Splendiani wrote:
>> Hi,
>>
>> I think that some caching with a minimum of query rewriting would get read
>> of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.
> Sorta.
> Client queries are inherently unpredictable. That's always been the case, and
> that predates SPARQL. These issues also exist in the SQL RDBMS realm, which
> is why you don't have SQL endpoints delivering what SPARQL endpoints provide.
I know, but I suspect that these days lot of these "intensive" queries are
explorative, just to check what is in the dataset, and may end up being very
similar in structure.
Jerven: can you report on your experience in this ? How much of problematic
queries are not really targeted, but more generic ?
>> From a user perspective, I would rather have a clear result code upfront
>> telling me: your query is to heavy, not enough resources and so on, than
>> partial results + extra codes.
> Yes, and you get that in some solutions e.g., what we provide. Basically, our
> server (subject to capacity) will tell you immediately that your query
> exceeds the query cost limits (this is different from timeout limits). The
> aforementioned feature was critical to getting the DBpedia SPARQL endpoint
> going, years ago.
Can you make a precise estimation of the query cost, or do you rely on some
heuristics ?
>> I won't do much of partial results anyway... so it's time wasted both sides.
> Not in a world where you have a public endpoint and zero control over the
> queries issued by clients.
> Not in a world where you to provide faceted navigation over entity relations
> as part of a "precision find" style service atop RDF based Linked Data etc..
I mean, partial results are ok if I have control on which part it is... a
system-dependent random subset of results is not much useful (not even for
statistics!)
>> One empiric solution could be to assign a quota per requesting IP (or other
>> form of identification).
> That's but one coarse-grained factor. You need to be able to associate a user
> agent (human or machine) profile with what ever quality of service you seek
> to scope to said profile. Again, this is the kind of thing we offer by
> leveraging WebID, Inference, and RDF right inside the core DBMS engine.
I agree. The finer the better. The IP-based approach is perhaps relatively easy
to implement if not much is provided by the system.
>
>> Then one could restrict the total amount of resource per time-frame,
>> possibly with smart policies.
> "Smart Policies" are the kind of thing you produce by exploiting the kind or
> entity relationship semantics baked into RDF based Linked Data. Basically,
> OWL (which is all about describing entity types and relation types semantics)
> serves this purpose very well. We certainly put it to use in our data access
> policy system which enables us to offer different capabilities and resource
> consumption to different human- or machine-agent profiles.
How do you use OWL for this ?
>
>> It would also avoid people breaking big queries in many small ones...
> You can't avoid bad or challenging queries. What you can do is look to
> fine-grained data access policies (semantically enhanced ACLs) to address
> this problem. This has always been the challenge, even before the emergence
> of the whole Semantic Web , RDF etc.. The same challenges also dogged the
> RDBMS realm. There is no dancing around this matter when dealing with
> traditional RDBMS or Web oriented data access.
>>
>> But I was wondering: why is resource consumption a problem for sparql
>> endpoint providers, and not for other "providers" on the web ? (say,
>> YouTube, Google, ...).
>> Is it the unpredictability of the resources needed ?
>
> Good question!
>
> They hide the problem behind airport sized data centers, and then they get
> you to foot the bill via your profile data which ultimately compromises your
> privacy.
Isn't the same possible with sparql, in principle ?
Although, I guess if a company would know that you spy on their queries...
there would be some issue (unlike for users and facebook, for some reason).
best,
Andrea
>
> This is a problem, and it's the ultimately basis for showcasing what RDF
> (entity relationship based data model endowed with *explicit* rather than
> *implicit* human- and machine-readable entity relationship semantics) is
> actually all about.
>
>
> Kingsley
>>
>> best,
>> Andrea
>>
>> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman
>> <[email protected]> ha scritto:
>>
>>> Hi All,
>>>
>>> Managing a public SPARQL endpoint has some difficulties in comparison to
>>> managing a simpler REST api.
>>> Instead of counting api calls or external bandwidth use we need to look at
>>> internal IO and CPU usage as well.
>>>
>>> Many of the current public SPARQL endpoints limit all their users to
>>> queries of limited CPU time.
>>> But this is not enough to really manage (mis) use of an endpoint. Also the
>>> SPARQL api being http based
>>> suffers from the problem that we first send the status code and may only
>>> find out later that we can't
>>> answer the query after all. Leading to a 200 not OK problem :(
>>>
>>> What approaches can we come up with as a community to embedded resource
>>> limit exceeded exceptions in the
>>> SPARQL protocols. e.g. we could add an exception element to the sparql xml
>>> result format.[1]
>>>
>>> The current limits to CPU use are not enough to really avoid misuse. Which
>>> is why I submitted a patch to
>>> Sesame that allows limits on memory use as well. Although limits on disk
>>> seeks or other IO counts may be needed by some as well.
>>>
>>> But these are currently hard limits what I really want is
>>> "playground limits" i.e. you can use the swing as much as you want if you
>>> are the only child in the park.
>>> Once there are more children you have to share.
>>>
>>> And how do we communicate this to our users. i.e. this result set is
>>> incomplete because you exceeded your IO
>>> quota please break up your queries in smaller blocks.
>>>
>>> For my day job where I do manage a 7.4 billion triple store with public
>>> access some extra tools in managing users would be
>>> great.
>>>
>>> Last but not least how can we avoid that users need to run SELECT
>>> (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>>> For beta.sparql.uniprot.org I have been moving much of this information
>>> into the sparql endpoint description but its not a place
>>> where people look for this information.
>>>
>>> Regards,
>>> Jerven
>>>
>>> [1] Yeah these ideas are not great timing just after 1.1 but we can always
>>> start SPARQL 1.2 ;)
>>>
>>>
>>>
>>> -------------------------------------------------------------------
>>> Jerven Bolleman [email protected]
>>> SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
>>> CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
>>> 1211 Geneve 4,
>>> Switzerland www.isb-sib.ch - www.uniprot.org
>>> Follow us at https://twitter.com/#!/uniprot
>>> -------------------------------------------------------------------
>>>
>>>
>>
>>
>>
>>
>
>
> --
>
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca handle: @kidehen
> Google+ Profile: https://plus.google.com/112399767740508618350/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
>
>
>
>
>