Alex and Paul correctly point out that they are in fact different queries.
> Looking at the code for Optional, I find myself in QueryIterLeftJoin,
> which is mostly implemented in QueryIterJoinBase. (Did I go to the
> right place?) From the looks of things in here, the first thing that
> code does is to create a table from its RHS argument.
>
> Is there any case where one of these operations evaluates itself WRT
> the RHS lazily?
I just wanted to add there is an optimization here.
If you run "qparse --print-opt" it will show the post-optimizer algebra.
Claude shows the direct translation of SPARQL to the algebra
(--print=op). The next step is various high level (algebra->algebra)
optimizations after which the algebra is:
(conditional
(conditional
(bgp (triple ?x rdf:type <foo>))
(bgp (triple ?x <bar> ?bar)))
(bgp (triple ?y <baz> ?baz)))
(conditional) is a special form of (leftjoin), like (sequence) is to
(join). You can stream the results from the left into the right without
worrying about variable scope issues and it means you can do what you
might call an index join in constant space.
Example:
{
?s :ifp 123 . ## Finds <s>
OPTIONAL { ?s :p ?v }
}
will make one call for <s> :p ?v -- i.e. ?s grounded to <s>.
This transformation is the normal case. Only weird things like doubly
nested optionals with a variable used in the outer and inner scopes, but
not the middle scope, or deep filters in the RHS which use an optional
variable which is also on the LHS block this transformation. In those
cases QueryIterLeftJoin is used - it's not very clever.
The normal path is QueryIterOptionalIndex which is streaming and
constant memory overhead.
Andy
On 03/07/12 17:11, Paul Gearon wrote:
On Tue, Jul 3, 2012 at 10:56 AM, Claude Warren <[email protected]> wrote:
I was looking at a couple of queries and found that
SELECT * WHERE {
?x a <foo> .
OPTIONAL{ ?x <bar> ?bar } .
OPTIONAL{ ?y <baz> ?baz }
}
becomes
(leftjoin
(leftjoin
(bgp (triple ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<file:///home/claude/workspace/RoadMap/foo>))
(bgp (triple ?x <file:///home/claude/workspace/RoadMap/bar> ?bar)))
(bgp (triple ?x <file:///home/claude/workspace/RoadMap/baz> ?baz)))
BUT ....
SELECT * WHERE {
?x a <foo> .
OPTIONAL{ ?x <bar> ?bar .
?y <baz> ?baz }
}
becomes
leftjoin
(bgp (triple ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<file:///home/claude/workspace/RoadMap/foo>))
(bgp
(triple ?x <file:///home/claude/workspace/RoadMap/bar> ?bar)
(triple ?x <file:///home/claude/workspace/RoadMap/baz> ?baz)
))
It seems to me that the 2nd form is much more efficient. (Am I wrong here?)
Doesn't the second form create an outer product on the RHS? Isn't that
going to be less efficient?
The outer product could be computer lazily, but then isn't the effect
going to be more like the first form?
Looking at the code for Optional, I find myself in QueryIterLeftJoin,
which is mostly implemented in QueryIterJoinBase. (Did I go to the
right place?) From the looks of things in here, the first thing that
code does is to create a table from its RHS argument.
Is there any case where one of these operations evaluates itself WRT
the RHS lazily?
That leads me to ask "Is there ever a time when 2 adjacent option
statements can not be merged into a single statement?"
I have work to do, so I'm not trying to think of every case here. I'm
just looking at what you've done where the first OPTIONAL shares a
variable with the LHS, and the second OPTIONAL shares none.
In that case, the merged form is nearly equivalent, but not quite.
If the first optional term:
OPTIONAL{ ?x <bar> ?bar }
results in no bindings and the second optional term:
OPTIONAL{ ?y <baz> ?baz }
has bindings, then the final result will contain the bindings from the
second optional term.
However, the outer product of the first term and the second will be
empty, so the merged form of:
OPTIONAL{ ?x <bar> ?bar . ?y <baz> ?baz}
will be empty. This means that the bindings for ?y and ?baz will NOT
appear in the merged form if there are no bindings for ?z and ?bar.
Ergo, they are not equivalent.
If you want to see, I just checked it against the following data:
@prefix ex: <http://example.com/ex#> .
ex:Alice ex:childOf ex:Bruce .
ex:Chuck ex:brotherOf ex:Bruce .
prefix ex: <http://example.com/ex#>
select * {
?x ex:brotherOf ?y
OPTIONAL {?x ex:foo ?z . ?s ex:childOf ?o}
}
result:
{ x=<http://example.com/ex#Chuck> y=<http://example.com/ex#Bruce> }
prefix ex: <http://example.com/ex#>
select * {
?x ex:brotherOf ?y
OPTIONAL {?x ex:foo ?z }
OPTIONAL {?s ex:childOf ?o}
}
result:
{ x=<http://example.com/ex#Chuck> y=<http://example.com/ex#Bruce>
s=<http://example.com/ex#Alice> o=<http://example.com/ex#Bruce> }
And if not, should the parser be optimized to do the merging?
The two could possibly be handled with the same code, so long as there
were flags to indicate the behavior to adopt should an edge case be
encountered (such as one of the terms being empty). Typically, you
want a compelling use case to add this kind of complexity. In this
case, that would mean two forms of the query with the same output on a
given data set operating at significantly different speeds.
Given that the current code materializes the RHS, then it often be
better to opt for the first form. A dynamic optimizer could make that
determination based on the relative sizes of the binding sets
resulting from each form (if each optional resulted in a large set of
bindings, then the outer product would be *very* large).
Regards,
Paul