Stephen Allen created JENA-999:
----------------------------------

             Summary: Poor jena-text query performance when a bound subject is 
used
                 Key: JENA-999
                 URL: https://issues.apache.org/jira/browse/JENA-999
             Project: Apache Jena
          Issue Type: Improvement
            Reporter: Stephen Allen
            Assignee: Stephen Allen
            Priority: Minor


When executing a jena-text query, the performance is terrible if the subject is 
already bound to a variable.  This is because the current code will execute a 
new lucene query that does not have the subject/entity bound on every iteration 
and then iterate through the lucene results to join against the subject.  This 
is quite inefficient.

Example query:
{code}
select *
where {
  ?s rdf:type <http://example.org/Entity> .
  ?s text:query ( rdfs:label "test" ) .
}
{code}
This would be quite slow if there were a lot of entities in the system.

Two potential solutions present themselves:
# Craft a more explicit lucene query that specifies the entity URI, so that the 
results coming back from lucene are much smaller.  However, this would cause 
problems with the score not being correct across multiple iterations.  
Additionally we are still potentially running a lot of lucene queries, each of 
which has a probably non-negligble constant cost (parsing the query string, 
etc).
# Execute the more general lucene query the first time it is encountered, then 
caching the results somewhere.  From there, we can then perform a hash table 
lookup against those cached results.

I would like to pursue option 2, but there is a problem.  Because jena-text is 
implemented as a property function instead of a query op in and of itself (like 
QueryIterMinus is for example), we have to find a place to stash the lucene 
results.  I believe this can be done by placing it in the ExecutionContext 
object, using the lucene query as a cache key.  Updates provide a slightly 
troubling case because you could have an update request like:
{code}
insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label 
"test" } ;

delete { ?s ?p ?o }
where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
"test" ) . ?p ?o . } ;

insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label 
"test" } ;

delete { ?s ?p ?o }
where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
"test" ) ; ?p ?o . }
{code}
And then the end result should be an empty database.  But if the 
ExecutionContext was the same for both delete queries, you would be using the 
cached results from the first delete query in the second delete query, which 
would result in {{<urn:test2>}} not being deleted properly.

If the ExecutionContext is indeed shared between the two update queries in the 
situation above, I think this can be solved by making the cache key for the 
lucene resultset be a combination of both the lucene query and the 
QueryIterRoot or BindingRoot.  I need to investigate this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to