Doug,

I believe TermDocs(t).freq()) gives me the number of documents in which the
term appears, as opposed to the total number of times the term appears. Any
particular reason for choosing that metric (it seems less accurate, but
maybe it's the only one that can be easily retrieved).

---

As for the costs mentioned below. I'd assume that they be different from
machine to machine. If currently we don't have the time for determining the
exact costs, do you have assumptions to start with (or magnitudes, for
example  k_tq >> j_tq etc.)? If not, we might go ahead with setting all the
constants to 1, as a starting point.

> The general cost of executing a TermQuery is something like:
> 
>     k_tq + j_tq * IndexReader.docFreq(t)
> 
> Where t is the term, and k_tq and j_tq are constants that need to be
> determined.  k_tq mostly represents the cost of looking the term up in
> the dictionary, and j_tq represents the cost of scoring each document.
> 
> The cost of a BooleanQuery is something like
> 
>     sum ( cost(c) +  k_bq )
>      c

> A worst-case cost function for SpanNearQuery is something like:
> 
>    sum (k_snq * cost(c) * log(|q|) )
>     c
> 
> where |q| is the number of terms in the query.  Things should be better
> than this, however, if the skipTo() optimization fires.  This is
> effective when one or more of the terms occurs less than 1/16th as many
> documents as another, and can provide up-to a 16-fold speedup.
> 
> > SpanOrQuery: 10 + total of the cost of all clauses
> 
>    sum ( k_soq * cost(c) * log(|q|) )
>     c
> 
> But here the skipTo() optimization is not effective.
> 
> > SpanTermQuery: 1 over #of characters in the term
> 
>     k_stq + j_stq * sum (TermDocs(t).freq())
>                      d


As always, thanks for your input,
Jochen






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to