Hi There!

We are in the process of building a query optimizer for Lucene RangeQueries
(we need that because we run fairly complex Range queries with a few hundred
terms against large corpuses, and response time needs improvement). We have
written a framework that allows for traversing queries and rearranging /
recreating subqueries.

In a next step, we tried to find criteria to optimize. A Simple one is to
reduce the total number of terms in the query. 

Question 1: Is it a good idea to minimize the # of terms.


Some optimization options however leave the choice of which term to reduce.
In order to make that choice we are using a fairly simple cost estimator for
queries and terms (currently we only deal with SpanNearQuery, SpanOrQuery
and SpanTermQuery)

SpanNearQuery: 10 - #of clauses + total of the cost of all clauses
SpanOrQuery: 10 + total of the cost of all clauses 
SpanTermQuery: 1 over #of characters in the term 

Question 2: Does anyone have better cost estimates or comments about this?


This optimization is all happening client side (i.e. as of the writing of
this, the optimizer does not know the statistics for tokens actually stored
in the index). 

Question 3: How do I get access to Term frequencies (i.e. the number of
times a given Term appears in the index). I assume that the way to go is
getTermFreqVectors in IndexWriter. This should allow for better choices as
to which term to eliminate.

Question 4: What are good cost estimates assuming that we have term
frequencies available?


And yes, if all of this ends up working we'll make the code available to the
project.

Cheers,
        Jochen



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to