Yes, this is exactly what I was looking for. Thanks a lot Andy.

On Sat, Dec 8, 2012 at 6:50 PM, Andy Seaborne <[email protected]> wrote:

> On 07/12/12 14:28, Laurent Pellegrino wrote:
>
>> Hello all,
>>
>> I wonder whether there exists a page that summarizes briefly how the
>> SPARQL
>> Operations and Functions are handled by TDB. The idea is to know what are
>> the functions or operations that use B+-tree indexes (or more generally a
>> specific datastructure, a property, etc.) to be resolved "efficiently" and
>> what are those that should use with care on a big dataset without a
>> previous BGP filtering (i.e. there is "no other way" to improve it or this
>> not yet implemented, etc.).
>>
>> If there is no documentation about that, does someone know how the
>> following operators are handled internally by Jena TDB if we suppose a BGP
>> that filters nothing and a FILTER with one of the following function or
>> operator (it should be the worst case) :
>>
>> - datatype(?x) = xsd:integer, is there a kind of index for each datatype
>> associated to a quad/triple such that when this condition appears it can
>> be
>> checked "efficiently" without comparing the values or NodeIds against all
>> the values return for example after a simple BGP? Is a datatype URI stored
>> inside the NodeTable or another table?
>>
>> - STRSTARTS(STR(?x), "coucou")
>>
>> - The simple "=" operator
>>
>> - sameTerm
>>
>> Kind Regards,
>>
>> Laurent
>>
>
> There isn't a strong connection between the functions and operators and
> the TDB index design but the high level optimizer does perform some
> optimization that work well with the indexes by making patterns as grounded
> as possible.
>
> Equality rewrite is done in ARQ; filter placement, for TDB, is done in TDB
> (it can also happen in ARQ - there's a tension between whether to optimize
> the BGP then place filters, or place filters then optimize.
>
> The equality filter isn't very aggressive on literals because BGP matching
> is by term, whereas FILTER is by value.  +0123 and 123 are the same value
> but different literals.  I'm not convinced this is a good idea and maybe it
> ought to change - data loading would canonicalize literals,
>
> The other change is a RDF 1.1 thing.  Simple literals go away and there is
> only xsd:strings so the "=" then works on lang-tag-less strings like BGP
> matching does.
>
> Examples:
>
> URIs are always safe to transform:
>
> { ?s ?p ?o . FILTER ( ?o = <uri> }
> =>
> { ?s ?p <uri> . BIND(<uri> AS ?o) }
>
> Sameterm on numbers
>
> { ?s ?p ?o . FILTER ( ?o = 123 }
>
> is not safe to transform (?o = 00123 is a match) but
>
> { ?s ?p ?o . FILTER ( sameTerm(?o,123) }
>
> is safe.
>
> sameTerm(?o,"abc"@en) is optimized
>
> (?o = "abc"@en) isn't optimized - you can call that a bug-of-omission.
> I don't see why it can't do it - it seems to treat it like
>
> (?o = "abc") which has two pattern matches.  Hmm - thinking about it, it
> could treat that as a disjunction of sameTerm.  Doable now ... a bit of a
> "doh" moment there.
>
> ?o = "abc"
> ==>
> sameTerm(?o, "abc") || sameTerm(?o, "abc"^^xsd:String)
>
> The equality optimization works with IN as well.  The query is expanded
> for the disjunction then equality rewrite happens.
>
> That's sameTerm and = covered - it's mainly a high level (algebra to
> algebra) transformation.
>
> datatype(?x) = xsd:integer
>
> There is no optimization for this.  SDB stores the data in a form where it
> could be done, but it doesn't.  TDB does not store the datatype separately
> - it could, but it would be a different table layout.
>
> Is it useful?
>
> STRSTARTS(STR(?x), "coucou")
>
> No specific optimization for this.  There isn't a prefix index.  Again,
> something that can be added, it just hasn't been.  You can use LARQ for
> similar effects.  It would be nice though to have an integrated prefix
> index, and even some regex acceleration (c.f. SQL's LIKE).
>
> Many optimizations can be done that aren't.  There is a slight issue that
> too much optimization means that simple queries slow down as more time is
> spent optimizing than just simply doing the query.  BSBM shows this.
>
> Also, it is useful to note that in TDB, certain datatypes are store
> "inline" -- that is, the value is stored in the index itself, using 56 bits
> of the 64 bit NodeId for the encoded value. That means converting an object
> in a triple to it's value for testing in the ARQ is quite cheap.  e.g.
> Testing being in a range is quite cheap (no node table access).  Parts of
> the BSBM benchmark show this up in quite extreme ways.
>
> Filter placement also happens : with a long BGP, the filter can be placed
> just after the point where all the necessary variables are defined, before
> further expansion of possibilities in later parts of the pattern.
>
> Hope that helps - is it what you are looking for?
>
>         Andy
>

Reply via email to