On 07/12/12 14:28, Laurent Pellegrino wrote:
Hello all,
I wonder whether there exists a page that summarizes briefly how the SPARQL
Operations and Functions are handled by TDB. The idea is to know what are
the functions or operations that use B+-tree indexes (or more generally a
specific datastructure, a property, etc.) to be resolved "efficiently" and
what are those that should use with care on a big dataset without a
previous BGP filtering (i.e. there is "no other way" to improve it or this
not yet implemented, etc.).
If there is no documentation about that, does someone know how the
following operators are handled internally by Jena TDB if we suppose a BGP
that filters nothing and a FILTER with one of the following function or
operator (it should be the worst case) :
- datatype(?x) = xsd:integer, is there a kind of index for each datatype
associated to a quad/triple such that when this condition appears it can be
checked "efficiently" without comparing the values or NodeIds against all
the values return for example after a simple BGP? Is a datatype URI stored
inside the NodeTable or another table?
- STRSTARTS(STR(?x), "coucou")
- The simple "=" operator
- sameTerm
Kind Regards,
Laurent
There isn't a strong connection between the functions and operators and
the TDB index design but the high level optimizer does perform some
optimization that work well with the indexes by making patterns as
grounded as possible.
Equality rewrite is done in ARQ; filter placement, for TDB, is done in
TDB (it can also happen in ARQ - there's a tension between whether to
optimize the BGP then place filters, or place filters then optimize.
The equality filter isn't very aggressive on literals because BGP
matching is by term, whereas FILTER is by value. +0123 and 123 are the
same value but different literals. I'm not convinced this is a good
idea and maybe it ought to change - data loading would canonicalize
literals,
The other change is a RDF 1.1 thing. Simple literals go away and there
is only xsd:strings so the "=" then works on lang-tag-less strings like
BGP matching does.
Examples:
URIs are always safe to transform:
{ ?s ?p ?o . FILTER ( ?o = <uri> }
=>
{ ?s ?p <uri> . BIND(<uri> AS ?o) }
Sameterm on numbers
{ ?s ?p ?o . FILTER ( ?o = 123 }
is not safe to transform (?o = 00123 is a match) but
{ ?s ?p ?o . FILTER ( sameTerm(?o,123) }
is safe.
sameTerm(?o,"abc"@en) is optimized
(?o = "abc"@en) isn't optimized - you can call that a bug-of-omission.
I don't see why it can't do it - it seems to treat it like
(?o = "abc") which has two pattern matches. Hmm - thinking about it, it
could treat that as a disjunction of sameTerm. Doable now ... a bit of
a "doh" moment there.
?o = "abc"
==>
sameTerm(?o, "abc") || sameTerm(?o, "abc"^^xsd:String)
The equality optimization works with IN as well. The query is expanded
for the disjunction then equality rewrite happens.
That's sameTerm and = covered - it's mainly a high level (algebra to
algebra) transformation.
datatype(?x) = xsd:integer
There is no optimization for this. SDB stores the data in a form where
it could be done, but it doesn't. TDB does not store the datatype
separately - it could, but it would be a different table layout.
Is it useful?
STRSTARTS(STR(?x), "coucou")
No specific optimization for this. There isn't a prefix index. Again,
something that can be added, it just hasn't been. You can use LARQ for
similar effects. It would be nice though to have an integrated prefix
index, and even some regex acceleration (c.f. SQL's LIKE).
Many optimizations can be done that aren't. There is a slight issue
that too much optimization means that simple queries slow down as more
time is spent optimizing than just simply doing the query. BSBM shows this.
Also, it is useful to note that in TDB, certain datatypes are store
"inline" -- that is, the value is stored in the index itself, using 56
bits of the 64 bit NodeId for the encoded value. That means converting
an object in a triple to it's value for testing in the ARQ is quite
cheap. e.g. Testing being in a range is quite cheap (no node table
access). Parts of the BSBM benchmark show this up in quite extreme ways.
Filter placement also happens : with a long BGP, the filter can be
placed just after the point where all the necessary variables are
defined, before further expansion of possibilities in later parts of the
pattern.
Hope that helps - is it what you are looking for?
Andy