[jira] [Updated] (JENA-709) Index join strategy may need to be more conservative when some sequence elements are potentially expensive

Rob Vesse (JIRA) Wed, 04 Jun 2014 02:07:27 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rob Vesse updated JENA-709:
---------------------------

    Description: 
As noted in a discussion of a poorly performing query on a mailing list thread 
(http://s.apache.org/cAn) there are cases where the introduction of 
{{sequence}} can actually make the query slower when some elements in the 
{{sequence}} are expensive to calculate e.g. sub-queries

The example query given is:

{noformat}
SELECT DISTINCT ?O ?T  ?E
WHERE
{  
  ?E a x:E. 
  {
    SELECT ?O ?T 
    WHERE 
    {
      ?O :oE ?E ;
            :oT ?T .
    } 
    ORDER BY DESC(?T)
    LIMIT 3
  }
}
{noformat}

Which produces the following algebra:

{noformat}
(distinct
 (project (?O ?T ?E)
  (sequence
   (bgp (triple ?E rdf:type x:E))
   (project (?O ?T)
    (top (3 (desc ?T))
     (bgp
      (triple ?O :oE ?/E)
      (triple ?O :oT ?T)
     ))))))
{noformat}

Because there are no common variables due to scoping the substitution of the 
bindings from the first sequence element into the sub-query has no effect so 
the expensive sub-query (note the {{top}} operator) gets executed in full for 
every single LHS solution

It is unclear from the discussion thread so far if this is just a badly written 
query and we don't have an example dataset that demonstrates the performance 
problems but just looking at the algebra it seems like we would be better 
avoiding use of {{sequence}} in favour of a plain {{join}} in a case like this

  was:
As noted in a discussion of a poorly performing query on a mailing list thread 
there are cases where the introduction of {{sequence}} can actually make the 
query slower when some elements in the {{sequence}} are expensive to calculate 
e.g. sub-queries

The example query given is:

{noformat}
SELECT DISTINCT ?O ?T  ?E
WHERE
{  
  ?E a x:E. 
  {
    SELECT ?O ?T 
    WHERE 
    {
      ?O :oE ?E ;
            :oT ?T .
    } 
    ORDER BY DESC(?T)
    LIMIT 3
  }
}
{noformat}

Which produces the following algebra:

{noformat}
(distinct
 (project (?O ?T ?E)
  (sequence
   (bgp (triple ?E rdf:type x:E))
   (project (?O ?T)
    (top (3 (desc ?T))
     (bgp
      (triple ?O :oE ?/E)
      (triple ?O :oT ?T)
     ))))))
{noformat}

Because there are no common variables due to scoping the substitution of the 
bindings from the first sequence element into the sub-query has no effect so 
the expensive sub-query (note the {{top}} operator) gets executed in full for 
every single LHS solution

It is unclear from the discussion thread so far if this is just a badly written 
query and we don't have an example dataset that demonstrates the performance 
problems but just looking at the algebra it seems like we would be better 
avoiding use of {{sequence}} in favour of a plain {{join}} in a case like this


> Index join strategy may need to be more conservative when some sequence 
> elements are potentially expensive
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: JENA-709
>                 URL: https://issues.apache.org/jira/browse/JENA-709
>             Project: Apache Jena
>          Issue Type: Brainstorming
>          Components: ARQ, Optimizer
>            Reporter: Rob Vesse
>
> As noted in a discussion of a poorly performing query on a mailing list 
> thread (http://s.apache.org/cAn) there are cases where the introduction of 
> {{sequence}} can actually make the query slower when some elements in the 
> {{sequence}} are expensive to calculate e.g. sub-queries
> The example query given is:
> {noformat}
> SELECT DISTINCT ?O ?T  ?E
> WHERE
> {  
>   ?E a x:E. 
>   {
>     SELECT ?O ?T 
>     WHERE 
>     {
>       ?O :oE ?E ;
>             :oT ?T .
>     } 
>     ORDER BY DESC(?T)
>     LIMIT 3
>   }
> }
> {noformat}
> Which produces the following algebra:
> {noformat}
> (distinct
>  (project (?O ?T ?E)
>   (sequence
>    (bgp (triple ?E rdf:type x:E))
>    (project (?O ?T)
>     (top (3 (desc ?T))
>      (bgp
>       (triple ?O :oE ?/E)
>       (triple ?O :oT ?T)
>      ))))))
> {noformat}
> Because there are no common variables due to scoping the substitution of the 
> bindings from the first sequence element into the sub-query has no effect so 
> the expensive sub-query (note the {{top}} operator) gets executed in full for 
> every single LHS solution
> It is unclear from the discussion thread so far if this is just a badly 
> written query and we don't have an example dataset that demonstrates the 
> performance problems but just looking at the algebra it seems like we would 
> be better avoiding use of {{sequence}} in favour of a plain {{join}} in a 
> case like this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (JENA-709) Index join strategy may need to be more conservative when some sequence elements are potentially expensive

Reply via email to