[jira] [Comment Edited] (JENA-1926) Query execution speed depends more on WHERE clause order than expected

Julian Gonggrijp (Jira) Fri, 26 Jun 2020 15:55:39 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146688#comment-17146688
 ]


Julian Gonggrijp edited comment on JENA-1926 at 6/26/20, 10:54 PM:
-------------------------------------------------------------------

Thank you so much [~andy] and [~rvesse] for taking the time to explain this to 
me. I really appreciate it.

I will not debate that ARQ has a reason to behave in the way it does, but 
before I can let this rest, I need to be 100% clear on whether these queries 
are equivalent or not in a set-theoretic sense. If they are, then I think it is 
somewhat problematic that there is a 3x performance difference between them. If 
they are not, then I need to understand which version is correct for my 
application.

??The reason they can be non-equivalent is because in the two versions 
{{?body}} may be bound to a different set of values.??

I can see, given the nesting of the graph patterns and the specifics of how ARQ 
operates, that this may be true _at the time of the {{OPTIONAL}} pattern 
evaluation_. What I still cannot wrap my head around, however, is how {{?body}} 
may be bound to a different set of values _by the time the entire query has 
been evaluated_.

Sure, in the fast version, the {{OPTIONAL}} pattern is tried against any 
{{?body}} that matches just the  {{?annotation oa:hasBody ?body}} triple 
pattern. Initially, this will generate {{?body ?c ?d}} triples that wouldn't be 
included in the slow version. But surely, these superfluous triples will be 
filtered out again as ARQ joins the result of the {{conditional}} with the 
larger {{bgp}}, as this constrains the set of possible values for 
{{?annotation}} in the same way as in the slow version?


was (Author: jgonggrijp):
Thank you so much [~andy] and [~rvesse] for taking the time to explain this to 
me. I really appreciate it.

I will not debate that ARQ has a reason to behave in the way it does, but 
before I can let this rest, I need to be 100% clear on whether these queries 
are equivalent or not in a set-theoretic sense. If they are, then I think it is 
somewhat problematic that there is a 3x performance difference between them. If 
they are not, then I need to understand which version is correct for my 
application.

?? The reason they can be non-equivalent is because in the two versions 
{{?body}} may be bound to a different set of values.??

I can see, given the nesting of the graph patterns and the specifics of how ARQ 
operates, that this may be true _at the time of the {{OPTIONAL}} pattern 
evaluation_. What I still cannot wrap my head around, however, is how {{?body}} 
may be bound to a different set of values _by the time the entire query has 
been evaluated_.

Sure, in the fast version, the {{OPTIONAL}} pattern is tried against any 
{{?body}} that matches just the  {{?annotation oa:hasBody ?body}} triple 
pattern. Initially, this will generate {{?body ?c ?d}} triples that wouldn't be 
included in the slow version. But surely, these superfluous triples will be 
filtered out again as ARQ joins the result of the {{conditional}} with the 
larger {{bgp}}, as this constrains the set of possible values for 
{{?annotation}} in the same way as in the slow version?

> Query execution speed depends more on WHERE clause order than expected
> ----------------------------------------------------------------------
>
>                 Key: JENA-1926
>                 URL: https://issues.apache.org/jira/browse/JENA-1926
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.15.0
>            Reporter: Julian Gonggrijp
>            Priority: Minor
>
> The following query takes about 6.5 seconds with my dataset, which 
> unfortunately I cannot share. Note that {{?source}} is bound to a single IRI 
> in all queries below; I'm leaving that out for brevity.
> {code:java}
> PREFIX oa: <http://www.w3.org/ns/oa#>
> PREFIX dcterms: <http://purl.org/dc/terms/>
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?body ?c ?d.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasBody ?body;
>                 oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
>     OPTIONAL { ?body ?c ?d }.
> }
> {code}
> Compare this to the following query, which I believe is exactly equivalent 
> but takes only 2 seconds:
> {code:java}
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?body ?c ?d.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasBody ?body.
>     OPTIONAL { ?body ?c ?d }.
>     ?annotation oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
> }
> {code}
>  For comparison, leaving out the optional {{?body}} entirely, I get a query 
> that executes in 1.7 seconds:
> {code:java}
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
> }
> {code}
> I'm a novice to SPARQL, but coming from SQL, I wouldn't expect query 
> execution speed to depend so much on the order in which the criteria are 
> given.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (JENA-1926) Query execution speed depends more on WHERE clause order than expected

Reply via email to