Yes, that is a significant issue. What I'm coming to realize is that either I will end up with something like
class MultiFilter { String field; private int[] termInDoc; Map<Term,int> termToInt; ... } which can be entirely built on the current lucene APIs but has significantly more overhead (the termToInt mapping in particular and the need to construct the mapping and array on startup) Or I can go deep into the guts and add a data file per-segment with a format something like int version int numFields (int fieldNum, long offset) ^ numFields (int termForDoc) ^ (maxDocs * numFields) and add something to FieldInfo like boolean storeMultiFilter; and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to add an int termNum to the .tis file as well. This is clearly a lot more work than the first solution, but it is a lot nicer to deal with as well. Is this interesting to anyone other than me? Tim On 11/9/08 12:23 PM, "Michael McCandless" <[EMAIL PROTECTED]> wrote: > > Conceivably, TermInfosReader could track the sequence number of each > term. > > A seek/skipTo would know which sequence number it just jumped too, > because the index is regular (every 128 terms by default), and then > each next() call could increment that. Then retrieving this number > would be as costly as calling eg IndexReader.docFreq(Term) is now. > > But I'm not sure how a multi-segment index would work, ie how would > MultiSegmentReader compute this for its terms? Or maybe you'd just do > this per-segment? > > Mike > > Tim Sturge wrote: > >> Hi, >> >> I¹m wondering if there is any easy technique to number the terms in >> an index >> (By number I mean map a sequence of terms to a contiguous range of >> integers >> and map terms to these numbers efficiently) >> >> Looking at the Term class and the .tis/.tii index format it appears >> that the >> terms are stored in an ordered and prefix-compressed format, but >> while there >> are pointers from a term to the .frq and .prx files, neither is really >> suitable as a sequence number. >> >> The reason I have this question is that I am writing a multi-filter >> for >> single term fields. My index contains many fields for which each >> document >> contains a single term (e.g. date, zipcode, country) and I need to >> perform >> range queries or set matches over these fields, many of which are very >> inclusive (they match >10% of the total documents) >> >> A cached RangeFilter works well when there are a small number of >> potential >> options (e.g. for countries) but when there are many options >> (consider a >> date range or a set of zipcodes) there are too many potential >> choices to >> cache each possibility and it is too inefficient to build a filter >> on the >> fly for each query (as you have to visit 10% of documents to build the >> filter despite the query itself matching 0.1%) >> >> Therefore I was considering building a int[reader.maxDocs()] array >> for each >> field and putting into it the term number for each document. This >> relies on >> the fact that each document contains only a single term for this >> field, but >> with it I should be able to quickly construct a ³multi-filter² (that >> is, >> something that iterates the array and checks that the term is in the >> range >> or set). >> >> Right now it looks like I can do some very ugly surgery and perhaps >> use the >> offset to the prx file even though it is not contiguous. But I¹m >> hoping >> there is a better technique that I¹m just not seeing right now. >> >> Thanks, >> >> Tim > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]