[jira] [Commented] (JENA-1926) Query execution speed depends more on WHERE clause order than expected

Rob Vesse (Jira) Fri, 26 Jun 2020 06:57:34 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146317#comment-17146317
 ]


Rob Vesse commented on JENA-1926:
---------------------------------

[~jgonggrijp] It is usually useful to look at the compiled SPARQL algebra when 
discussing equivalence as things that look equivalent when written in SPARQL 
syntax often have subtle but important differences when inspecting the algebra. 
 You can inspect the algebra for a query like so ({{qparse}} is part of the 
Jena command line tools):

{noformat}
> qparse --explain --query <query-file>
{noformat}

For your slow query we get the following algebra:

{noformat}
(prefix ((dcterms: <http://purl.org/dc/terms/>)
         (oa: <http://www.w3.org/ns/oa#>))
  (conditional
    (bgp
      (triple ?annotation oa:hasBody ?body)
      (triple ?annotation oa:hasTarget ?target)
      (triple ?annotation dcterms:creator ?user)
      (triple ?annotation ?a ?b)
      (triple ?target oa:hasSource ?source)
      (triple ?target oa:hasSelector ?selector)
      (triple ?target ?e ?f)
      (triple ?selector ?g ?h)
    )
    (bgp (triple ?body ?c ?d))))
{noformat}

For the second we get the following algebra:

{noformat}
(prefix ((dcterms: <http://purl.org/dc/terms/>)
         (oa: <http://www.w3.org/ns/oa#>))
  (sequence
    (conditional
      (bgp (triple ?annotation oa:hasBody ?body))
      (bgp (triple ?body ?c ?d)))
    (bgp
      (triple ?annotation oa:hasTarget ?target)
      (triple ?annotation dcterms:creator ?user)
      (triple ?annotation ?a ?b)
      (triple ?target oa:hasSource ?source)
      (triple ?target oa:hasSelector ?selector)
      (triple ?target ?e ?f)
      (triple ?selector ?g ?h)
    )))
{noformat}

Which are structurally quite different as [~andy] pointed out with the joins 
done in quite different orders.

The reason they can be non-equivalent is because in the two versions {{?body}} 
may be bound to a different set of values.  In the fast version you are 
considering every possible {{?annotation oa:hasBody ?body}} triple in isolation 
whereas in the slow version you are considering only {{?body}} values from a 
much larger graph pattern.  Now it may be that for this particular query and 
your data the two queries are equivalent but that does not necessarily hold 
true for the general case.

Optimizers are built for general cases and in ARQs case we tend to favour 
correctness over optimisation (there have been quite a few bugs around 
{{OPTIONAL}} in the past).  Yes the optimiser could always be smarter but the 
non-commutative, non-associative nature of {{OPTIONAL}} makes this difficult.

Semantically replacing {{?source}} with a {{<uri>}} makes no difference, but 
from a runtime optimisation perspective the number of ground terms in a 
particular triple pattern impacts the order in which ARQ will choose to 
evaluate triple patterns within a larger BGP which depending on the structure 
of your data can improve performance.



> Query execution speed depends more on WHERE clause order than expected
> ----------------------------------------------------------------------
>
>                 Key: JENA-1926
>                 URL: https://issues.apache.org/jira/browse/JENA-1926
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Fuseki, TDB2
>    Affects Versions: Jena 3.15.0
>            Reporter: Julian Gonggrijp
>            Priority: Minor
>
> The following query takes about 6.5 seconds with my dataset, which 
> unfortunately I cannot share. Note that {{?source}} is bound to a single IRI 
> in all queries below; I'm leaving that out for brevity.
> {code:java}
> PREFIX oa: <http://www.w3.org/ns/oa#>
> PREFIX dcterms: <http://purl.org/dc/terms/>
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?body ?c ?d.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasBody ?body;
>                 oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
>     OPTIONAL { ?body ?c ?d }.
> }
> {code}
> Compare this to the following query, which I believe is exactly equivalent 
> but takes only 2 seconds:
> {code:java}
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?body ?c ?d.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasBody ?body.
>     OPTIONAL { ?body ?c ?d }.
>     ?annotation oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
> }
> {code}
>  For comparison, leaving out the optional {{?body}} entirely, I get a query 
> that executes in 1.7 seconds:
> {code:java}
> CONSTRUCT {
>     ?annotation ?a ?b.
>     ?target ?e ?f.
>     ?selector ?g ?h.
> } WHERE {
>     ?annotation oa:hasTarget ?target;
>                 dcterms:creator ?user;
>                 ?a ?b.
>     ?target oa:hasSource ?source;
>             oa:hasSelector ?selector;
>             ?e ?f.
>     ?selector ?g ?h.
> }
> {code}
> I'm a novice to SPARQL, but coming from SQL, I wouldn't expect query 
> execution speed to depend so much on the order in which the criteria are 
> given.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1926) Query execution speed depends more on WHERE clause order than expected

Reply via email to