[ 
https://issues.apache.org/jira/browse/JENA-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864137#comment-13864137
 ] 

Rob Vesse commented on JENA-615:
--------------------------------

So I have four queries which I'm testing:

# Single value != {{FILTER}}
# Single value {{MINUS}}
# Multiple value != {{FILTER}}
# Multiple value {{MINUS}}

The queries are tailored to use values specific to the datasets.

For LUBM I only tested LUBM 0 and got the following results:

# Single value != {{FILTER}} - 0.2566s
# Single value {{MINUS}} - 0.2568s
# Multiple value != {{FILTER}} - 0.2149s
# Multiple value {{MINUS}} - 0.2081s

For SP2B I tested 10k, 50k and 250k.

10k:

# Single value != {{FILTER}} - 0.0321s
# Single value {{MINUS}} - 0.0313s
# Multiple value != {{FILTER}} - 0.0845s
# Multiple value {{MINUS}} - 0.0814s

50k:

# Single value != {{FILTER}} - 0.1012s
# Single value {{MINUS}} - 0.1000s
# Multiple value != {{FILTER}} - 0.2955s
# Multiple value {{MINUS}} - 0.2862s

250k:

# Single value != {{FILTER}} - 0.4416s
# Single value {{MINUS}} - 0.4104s
# Multiple value != {{FILTER}} - 0.9534s
# Multiple value {{MINUS}} - 0.9064s

So as mentioned before the difference is fairly minimal, it's better when there 
are multiple {{!=}} clauses and the gap increases as the dataset size 
increases.  Obviously these are all fairly trivially sized datasets, I'm going 
to grab some figures for a 1M triple dataset which is still small but is about 
as much as I can reasonably run on my local laptop (mostly due to disk space 
constraints not RAM, SSDs are awesome but tiny!)

> Possible optimisation for FILTER(?var != <constant>)
> ----------------------------------------------------
>
>                 Key: JENA-615
>                 URL: https://issues.apache.org/jira/browse/JENA-615
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>            Priority: Minor
>              Labels: algebra, optimization, sparql
>
> I have an idea for a possible optimisation for queries of the following 
> general form:
> {noformat}
> SELECT *
> WHERE
> {
>   # Some Patterns
>   FILTER(?var != <http://constant>)
> } 
> {noformat}
> This pattern crops up surprisingly often in real SPARQL workloads since it is 
> often used to either limit a variable to exclude certain possibilities or to 
> avoid self referential links in the data.
> In some cases it seems like this could be safely rewritten as follows:
> {noformat}
> SELECT *
> WHERE
> {
>   # Some Patterns
>   MINUS { BIND(<http://constant> AS ?var) }
> }
> {noformat}
> Or perhaps in a more generalised form like so:
> {noformat}
> SELECT * WHERE
> {
>   # Some patterns
>   MINUS { VALUES ?var { <http://constant/1> <http://constant/2> } }
> }
> {noformat}
> Which would nicely deal with the case of stating that a variable is not equal 
> to multiple constant values.
> As I pointed out earlier this would not apply in every case, specifically I 
> think at least the following must be true:
> - The variable must be guaranteed to be bound (similar to existing filter 
> equality and implicit join optimisations)
> There is also the potential to spot cases where the variable will always be 
> unbound and thus the expression is always an error and replace the entire 
> sub-tree with {{table empty}} as we already do for equality and implicit join 
> filters.
> I plan on taking a look at implementing this in the new year, if anyone has 
> any thoughts on this (especially wrt to restrictions that should apply to 
> when the optimisation is considered safe) then please comment.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to