Conceivably, TermInfosReader could track the sequence number of each
term.
A seek/skipTo would know which sequence number it just jumped too,
because the index is regular (every 128 terms by default), and then
each next() call could increment that. Then retrieving this number
would be as costly as calling eg IndexReader.docFreq(Term) is now.
But I'm not sure how a multi-segment index would work, ie how would
MultiSegmentReader compute this for its terms? Or maybe you'd just do
this per-segment?
Mike
Tim Sturge wrote:
Hi,
I’m wondering if there is any easy technique to number the terms in
an index
(By number I mean map a sequence of terms to a contiguous range of
integers
and map terms to these numbers efficiently)
Looking at the Term class and the .tis/.tii index format it appears
that the
terms are stored in an ordered and prefix-compressed format, but
while there
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.
The reason I have this question is that I am writing a multi-filter
for
single term fields. My index contains many fields for which each
document
contains a single term (e.g. date, zipcode, country) and I need to
perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)
A cached RangeFilter works well when there are a small number of
potential
options (e.g. for countries) but when there are many options
(consider a
date range or a set of zipcodes) there are too many potential
choices to
cache each possibility and it is too inefficient to build a filter
on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)
Therefore I was considering building a int[reader.maxDocs()] array
for each
field and putting into it the term number for each document. This
relies on
the fact that each document contains only a single term for this
field, but
with it I should be able to quickly construct a “multi-filter” (that
is,
something that iterates the array and checks that the term is in the
range
or set).
Right now it looks like I can do some very ugly surgery and perhaps
use the
offset to the prx file even though it is not contiguous. But I’m
hoping
there is a better technique that I’m just not seeing right now.
Thanks,
Tim
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]