Re: SPARQL Operators and Functions with TDB

Andy Seaborne Sat, 08 Dec 2012 09:52:47 -0800

On 07/12/12 14:28, Laurent Pellegrino wrote:

Hello all,


I wonder whether there exists a page that summarizes briefly how the SPARQL
Operations and Functions are handled by TDB. The idea is to know what are
the functions or operations that use B+-tree indexes (or more generally a
specific datastructure, a property, etc.) to be resolved "efficiently" and
what are those that should use with care on a big dataset without a
previous BGP filtering (i.e. there is "no other way" to improve it or this
not yet implemented, etc.).

If there is no documentation about that, does someone know how the
following operators are handled internally by Jena TDB if we suppose a BGP
that filters nothing and a FILTER with one of the following function or
operator (it should be the worst case) :

- datatype(?x) = xsd:integer, is there a kind of index for each datatype
associated to a quad/triple such that when this condition appears it can be
checked "efficiently" without comparing the values or NodeIds against all
the values return for example after a simple BGP? Is a datatype URI stored
inside the NodeTable or another table?

- STRSTARTS(STR(?x), "coucou")

- The simple "=" operator

- sameTerm

Kind Regards,

Laurent

There isn't a strong connection between the functions and operators andthe TDB index design but the high level optimizer does perform someoptimization that work well with the indexes by making patterns asgrounded as possible.

Equality rewrite is done in ARQ; filter placement, for TDB, is done inTDB (it can also happen in ARQ - there's a tension between whether tooptimize the BGP then place filters, or place filters then optimize.

The equality filter isn't very aggressive on literals because BGPmatching is by term, whereas FILTER is by value. +0123 and 123 are thesame value but different literals. I'm not convinced this is a goodidea and maybe it ought to change - data loading would canonicalizeliterals,

The other change is a RDF 1.1 thing. Simple literals go away and thereis only xsd:strings so the "=" then works on lang-tag-less strings likeBGP matching does.


Examples:

URIs are always safe to transform:

{ ?s ?p ?o . FILTER ( ?o = <uri> }
=>
{ ?s ?p <uri> . BIND(<uri> AS ?o) }

Sameterm on numbers

{ ?s ?p ?o . FILTER ( ?o = 123 }

is not safe to transform (?o = 00123 is a match) but

{ ?s ?p ?o . FILTER ( sameTerm(?o,123) }

is safe.

sameTerm(?o,"abc"@en) is optimized

(?o = "abc"@en) isn't optimized - you can call that a bug-of-omission.
I don't see why it can't do it - it seems to treat it like

(?o = "abc") which has two pattern matches. Hmm - thinking about it, itcould treat that as a disjunction of sameTerm. Doable now ... a bit ofa "doh" moment there.


?o = "abc"
==>
sameTerm(?o, "abc") || sameTerm(?o, "abc"^^xsd:String)

The equality optimization works with IN as well. The query is expandedfor the disjunction then equality rewrite happens.

That's sameTerm and = covered - it's mainly a high level (algebra toalgebra) transformation.


datatype(?x) = xsd:integer

There is no optimization for this. SDB stores the data in a form whereit could be done, but it doesn't. TDB does not store the datatypeseparately - it could, but it would be a different table layout.


Is it useful?

STRSTARTS(STR(?x), "coucou")

No specific optimization for this. There isn't a prefix index. Again,something that can be added, it just hasn't been. You can use LARQ forsimilar effects. It would be nice though to have an integrated prefixindex, and even some regex acceleration (c.f. SQL's LIKE).

Many optimizations can be done that aren't. There is a slight issuethat too much optimization means that simple queries slow down as moretime is spent optimizing than just simply doing the query. BSBM shows this.

Also, it is useful to note that in TDB, certain datatypes are store"inline" -- that is, the value is stored in the index itself, using 56bits of the 64 bit NodeId for the encoded value. That means convertingan object in a triple to it's value for testing in the ARQ is quitecheap. e.g. Testing being in a range is quite cheap (no node tableaccess). Parts of the BSBM benchmark show this up in quite extreme ways.

Filter placement also happens : with a long BGP, the filter can beplaced just after the point where all the necessary variables aredefined, before further expansion of possibilities in later parts of thepattern.


Hope that helps - is it what you are looking for?

        Andy

Re: SPARQL Operators and Functions with TDB

Reply via email to