[ 
https://issues.apache.org/jira/browse/JENA-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132560#comment-13132560
 ] 

Paolo Castagna commented on JENA-144:
-------------------------------------

> This ought to be done properly

Absolutely. 

Describing more in detail what's need to be done and improving the description 
above would help me or others who are willing to help implementing this. I am 
still on the investigating/learning/feasibility stage on this. Any guidance and 
pointing to the right direction is welcome. :-) 

> relates to GeoSPARQL

Yes, another similar use case is bounded box range queries to find things 
located in a rectangular area. So, instead of a date we have latitude and 
longitude, but same sort of optimisation applies.
Indeed, all sort of queries FILTERing over a range of values, so long those 
values are inlined in the TDB indexes, should benefit from this.

I have a doubt: if a value within a range cannot be encoded inline in the index 
(it would be in the node table), would this optimisation work in that case? I 
don't see how we could make sure we deliver the right answer in those cases...
                
> An optimimsation for queries with FILTER ((?date > "..."^^xsd:dateTime) && 
> (?date < "..."^^xsd:dateTime)) 
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: JENA-144
>                 URL: https://issues.apache.org/jira/browse/JENA-144
>             Project: Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>              Labels: optimization, performance
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> When TDB index literal values, if possible, it encodes the literal value 
> directly into the NodeId. 
> See NodeId.inline(Node node) method:
> http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/NodeId.java
> At query time, since there isn't an entry in the node table for values 
> encoded in this way, there is no need to perform lookups on the node table.
> Let's consider this query pattern:
>     ?s <http://purl.org/dc/elements/1.1/date> ?date .
>     FILTER ( ( ?date > "2011-06-06T00:00:00Z"^^xsd:dateTime ) &&
>              ( ?date < "2011-06-07T00:00:00Z"^^xsd:dateTime ) )
> In this case the POS index will be used, doing a partial scan with a fixed P: 
> [(P,0,0), (P+1,0,0)) where P is the NodeId corresponding to property used in 
> the BGP (i.e. <http://purl.org/dc/elements/1.1/date> in the example above).
> However, if there are many subjects with a date, the filter expression needs 
> to be evaluated for all the date values. Even if those date values came 
> straight out of the POS index and not from the node table, this can take a 
> while.
> We could have a better range index scan which starts at a particular value 
> (i.e. "2011-06-06T00:00:00Z"^^xsd:dateTime, from the example above). The 
> range index scan could be: [(P,D1,0), (P,D2,0)) where D1 and D2 are the 
> NodeId corresponding to the values specified in the FILTER expression.
> It is also not clear how the optimizer could decide if this will be more 
> selective than other triple patterns.
> See a couple of thread on jena-dev and jena-users mailing lists related to 
> this:
>  - http://markmail.org/thread/czopj5de3w62aacn
>  - http://markmail.org/thread/pfwl6ukbpqfw23r6
> (Or, maybe, this sort of optimisation is too specific, overly complicated... 
> and a caching layer would solve this and many other performance related 
> issues! ;-))

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to