Rob Vesse created JENA-441:
------------------------------

             Summary: In some cases it may be useful to apply DISTINCT before 
applying ORDER BY
                 Key: JENA-441
                 URL: https://issues.apache.org/jira/browse/JENA-441
             Project: Apache Jena
          Issue Type: Improvement
          Components: ARQ, Optimizer
    Affects Versions: Jena 2.10.0
            Reporter: Rob Vesse
            Assignee: Rob Vesse
            Priority: Minor


One of our internal users highlighted an interesting query where changing the 
plan makes a big difference in performance.

The query is essentially the following:

SELECT DISTINCT ?p
WHERE 
{
  ?s ?p ?o
} ORDER BY ?p

Leaving the fact that it is a fundamentally dumb query to write the user had an 
interesting suggestion about the query plan, currently this generates the 
following:

(distinct
    (project (?predicate)
      (order (?predicate)
        (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))

For cases like this it may actually be much more performant to do the distinct 
first, because of the associated semantics of the various operators you can't 
just simply put distinct before the order but if you rewrite the query as 
follows:

SELECT ?p
WHERE
{
  { SELECT DISTINCT ?p WHERE { ?s ?p ?o } }
} ORDER BY ?p

You get the likely much more performant plan:

(project (?predicate)
    (order (?predicate)
      (distinct
        (project (?predicate)
          (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?/s ?predicate 
?/o))))))

Clearly this optimization does not apply in the general case, I think it only 
applies in the case where you have a DISTINCT and all ORDER BY conditions are 
simple variables and only those variables are projected in which case I think 
you could produce a plan like the following:

(order (?predicate)
    (project (?predicate)
      (distinct
        (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))

I am pretty certain this applies in only a few cases and I haven't reproduced 
this with TDB to see if the performance difference is noticeable yet.

Andy - I will try and gather some more information and experiment with this so 
don't feel you have to look at this one for the time being.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to