Re: Term numbering and range filtering

Michael McCandless Sun, 09 Nov 2008 12:24:34 -0800

Conceivably, TermInfosReader could track the sequence number of eachterm.

A seek/skipTo would know which sequence number it just jumped too,because the index is regular (every 128 terms by default), and theneach next() call could increment that. Then retrieving this numberwould be as costly as calling eg IndexReader.docFreq(Term) is now.

But I'm not sure how a multi-segment index would work, ie how wouldMultiSegmentReader compute this for its terms? Or maybe you'd just dothis per-segment?


Mike

Tim Sturge wrote:

Hi,
I’m wondering if there is any easy technique to number the terms inan index(By number I mean map a sequence of terms to a contiguous range ofintegers
and map terms to these numbers efficiently)
Looking at the Term class and the .tis/.tii index format it appearsthat theterms are stored in an ordered and prefix-compressed format, butwhile there
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.
The reason I have this question is that I am writing a multi-filterforsingle term fields. My index contains many fields for which eachdocumentcontains a single term (e.g. date, zipcode, country) and I need toperform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)
A cached RangeFilter works well when there are a small number ofpotentialoptions (e.g. for countries) but when there are many options(consider adate range or a set of zipcodes) there are too many potentialchoices tocache each possibility and it is too inefficient to build a filteron the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)
Therefore I was considering building a int[reader.maxDocs()] arrayfor eachfield and putting into it the term number for each document. Thisrelies onthe fact that each document contains only a single term for thisfield, butwith it I should be able to quickly construct a “multi-filter” (thatis,something that iterates the array and checks that the term is in therange
or set).
Right now it looks like I can do some very ugly surgery and perhapsuse theoffset to the prx file even though it is not contiguous. But I’mhoping
there is a better technique that I’m just not seeing right now.

Thanks,

Tim



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

Reply via email to