Re: Initial Bindings in Query Evaluation

Rob Vesse Fri, 02 Aug 2013 16:51:56 -0700

Holger

Comments inline:

On 8/2/13 3:50 PM, "Holger Knublauch" <[email protected]> wrote:

>Folks,
>
>all I did was report a bug, and your response is to delete the whole
>feature! This would completely throw out the baby with the bathwater and
>may mean that we at TopQuadrant have to branch off to our own Jena
>version when this happens.

I started the discussion because what your bug highlights is that initial
bindings as they are currently implemented are broken, we appear to
disagree on how they are broken but clearly they are not entirely correct
in their behavior one way or the other.

I'm aiming to solicit opinions from all interested parties and try to get
a consensus about how this feature should be handled in the future.
Whether that means merely fixing the current implementation, replacing it
with another solution or something else entirely.  Note that I explicitly
said that this is about re-architecting this feature in some future
release, no-one is going into the code base and deleting this feature
immediately.  If the eventual decision is to remove this feature in its
current form then there would be a deprecation cycle.

Forcing TQ to fork Jena is not my intention, I'm trying to find an
appropriate means to address something which I consider to be broken in a
way that benefits the whole community.  Note that there are other places
where initial bindings are completely unimplemented/broken depending on
your POV e.g. remote queries.

>
>I would be in favor of this option
>
>5) Keep the API as it is and restore initial bindings for UPDATE as well.

Side Note - Initial bindings for updates was removed because it was a
barrier to streaming updates
(http://markmail.org/message/bazwh2exmcc5vmoh).  Also as others noted in
the discussion there initial bindings is a little murkier for updates
since does it apply only to WHERE clauses, to all portions of requests,
etc?

Keeping the API as-is is always an option, if this ends up being the
preference of the community then we definitely need to improve the
documentation to note that there can be unintended interactions with other
parts of the query engine such as the optimizer when initial bindings are
used.

>
>It was IMHO a mistake to switch to parameterized queries. We make very
>heavy use of initial bindings throughout our software stack, SPIN, SWP,
>SPARQLMotion. They are a great feature, allowing users to treat SPARQL
>like a programming language (think about a function call, where the
>parameters are different each time, yet the developer only needs to
>write the logic once).

What you are talking about sounds much more like cached execution plans in
SQL.  I understand the analogy of a SPARQL query to a function but SPARQL
variables were not intended to be function arguments, that you choose to
treat them as such and that initial bindings lets you treat them as such
is a perhaps unintended consequence of ARQ's API.

>
> From what I understand so far, parameterized SPARQL has performance
>issues - we would need to re-parse the string while with initial
>bindings we can reuse the same (compiled) Query object each time.
>Parameterized queries also don't support the bound(?var) operator, and
>probably others.

Yes there is a performance hit associated with parameterized queries and
yes it is possible to turn a syntactically valid query into an invalid one
by injecting certain values because of the way the parameterization is
done on a purely textual basis.

This is why I suggested Options 2 and 3 which are algebraic manipulations
which would allow the initial bindings to be injected by transformation of
the unoptimized algebra tree.  This allows you to only take the parsing
hit once, use initial bindings as you do now without the optimizer making
incorrect decisions because of information that initial bindings gives
that it doesn't have access to.  These options require substantially more
work to get all the kinks worked out but are undoubtably doable, if people
agree this is the way forward then likely this will happen eventually but
contributions to make it happen sooner are always welcome.  This would
give users something much closer to cached execution plans which from what
you described is essentially how TQ are using initial bindings.

Also note that as a user you always have the option of using the BIND
and/or VALUES clauses in your queries and updates which achieves the same
ends as the initial bindings API and has the benefit of being semantically
much clearer from the POV of both the query writer and query engine.

>
>We have used initial bindings successfully for many years, and although
>there have been occasional bugs (I probably reported one such bug per
>year) it is a very essential feature from our point of view. If the
>overhead of fixing the optimizer is too big, I would be OK with
>switching off this optimizer if initial bindings are present. It worked
>fine without this optimizer for many years.

I understand that TQ considers the feature essential and I hope we as a
community can come to some agreement on this topic that satisfies the
whole community.

The optimizer is not broken, what the optimizer is doing is entirely
correct wrt to the semantics of the SPARQL query it is given and the
information it has.  As I pointed out in my original reply to you there
are other optimizations that have had the same behavior for years.  That
none of your unit tests previously covered a case which would be affected
in this way whereas the new optimization does happen to cover such a case
is a lucky coincidence as otherwise neither you/ourselves would have been
any the wiser.

The example query semantically should always return false, that the
initial bindings API allows a user to make it do otherwise is the bug IMO
though I think it is clear we disagree on this point.

Individual optimizations can always be turned off, for this specific one:

ARQ.getContext().set(ARQ.optFilterImplicitJoin, false);

If FILTER(?x = ?y) and FILTER(?x = <uri>) are common patterns that TQ and
SPIN users make use of in conjunction with initial bindings then you
should perhaps turn both this and ARQ.optFilterEquality off.  Though if
the problem is only in strange corner cases like the example you gave then
you may be perfectly fine since the special case causing you problems only
occurs when the variables are not used within the inner operator that the
FILTER applies over.

Switching the whole optimizer off is overkill and will almost certainly
harm performance because that will include disabling critical
optimizations such as ARQs index join strategy.

Rob

>
>HTH
>Holger
>
>
>On 8/3/2013 2:52, Rob Vesse wrote:
>> Hi All
>>
>> Holger's question
>>(http://mail-archives.apache.org/mod_mbox/jena-users/201308.mbox/%3c51FB8
>>[email protected]%3e) about a regression in ARQs treatment of
>>initial bindings raises an interesting disconnect between the
>>interpretation of SPARQL and the Initial Bindings API.
>>
>> Initial bindings in their current form allows for users to essentially
>>change the semantics of a query in a non-intuitive way.  Take his
>>example query:
>>
>> ASK { FILTER(?a = ?b) }
>>
>> Intuitively that query MUST always return false yet with initial
>>bindings in the mix the query can be made to return true, at least prior
>>to 2.10.2 which introduces a new optimizer which includes special case
>>recognition for this.
>>
>> The problem is that using initial bindings can fundamentally change the
>>semantics of queries in non-intuitive ways when I believe the intention
>>of the API was merely to allow for improved performance by guiding the
>>engine.
>>
>> To me this suggests that initial bindings as currently implemented is
>>fundamentally flawed and I would suggest that we think about
>>re-architecting this feature in a future release (not the next release).
>> I believe there are probably several ways of doing this:
>>
>> 1  Remove support for initial bindings on queries entirely (as we
>>already did for updates) in favor of using ParameterizedSparqlString
>>
>> 2  Change initial bindings to be a pre-optimization algebra
>>transformation of the query
>>
>> As we've discussed previously in the context of
>>ParameterizedSparqlString there is potential to do the substitution at
>>the algebra tree level rather than at the textual level.  This allows
>>for stronger syntax checking and actually changes the query
>>appropriately.  The problem with this is that it doesn't work if we want
>>to inject multiple values for a variable, hence Option 3
>>
>> 3  Change initial bindings to be done by injection of VALUES clauses
>>
>> This approach is again by algebra transform and would involve inserting
>>VALUES clauses at each leaf of the algebra tree.  So Holger's query with
>>initial bindings applied would be rewritten like so:
>>
>> ASK
>> {
>>    VALUES ( ?a ?b ) { ( true true ) }
>>    FILTER (?a = b)
>> }
>>
>> However this approach might get rather complex for larger queries and
>>also runs into issues of scope, what if we insert the VALUES clause
>>inside of a sub-query which doesn't propagate those initial bindings
>>outside of it etc.
>>
>> 4  Skip optimization when initial bindings are involved
>>
>> This is the easiest approach but we can't enforce this on other query
>>engine implementations and it could seriously harm performance for those
>>that use initial bindings extensively.
>>
>> There may also be other approaches I haven't thought so please suggest
>>anything that makes sense.  Bottom line is that initial bindings in its
>>current form seems fundamentally broken to me and we should be thinking
>>of how to fix this in the future.
>>
>> Rob
>>
>

Re: Initial Bindings in Query Evaluation

Reply via email to