Tim, I didn't follow all the details, so this may be somewhat off, but did you consider using TermVectors?
Regards, Paul Elschot Op Monday 10 November 2008 19:18:38 schreef Tim Sturge: > Yes, that is a significant issue. What I'm coming to realize is that > either I will end up with something like > > class MultiFilter { > String field; > private int[] termInDoc; > Map<Term,int> termToInt; > ... > } > > which can be entirely built on the current lucene APIs but has > significantly more overhead (the termToInt mapping in particular and > the need to construct the mapping and array on startup) > > > Or I can go deep into the guts and add a data file per-segment with a > format something like > > int version > int numFields > (int fieldNum, long offset) ^ numFields > (int termForDoc) ^ (maxDocs * numFields) > > and add something to FieldInfo like boolean storeMultiFilter; > and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to > add an int termNum to the .tis file as well. > > This is clearly a lot more work than the first solution, but it is a > lot nicer to deal with as well. Is this interesting to anyone other > than me? > > Tim > > On 11/9/08 12:23 PM, "Michael McCandless" <[EMAIL PROTECTED]> wrote: > > Conceivably, TermInfosReader could track the sequence number of > > each term. > > > > A seek/skipTo would know which sequence number it just jumped too, > > because the index is regular (every 128 terms by default), and then > > each next() call could increment that. Then retrieving this number > > would be as costly as calling eg IndexReader.docFreq(Term) is now. > > > > But I'm not sure how a multi-segment index would work, ie how > > would MultiSegmentReader compute this for its terms? Or maybe > > you'd just do this per-segment? > > > > Mike > > > > Tim Sturge wrote: > >> Hi, > >> > >> I¹m wondering if there is any easy technique to number the terms > >> in an index > >> (By number I mean map a sequence of terms to a contiguous range of > >> integers > >> and map terms to these numbers efficiently) > >> > >> Looking at the Term class and the .tis/.tii index format it > >> appears that the > >> terms are stored in an ordered and prefix-compressed format, but > >> while there > >> are pointers from a term to the .frq and .prx files, neither is > >> really suitable as a sequence number. > >> > >> The reason I have this question is that I am writing a > >> multi-filter for > >> single term fields. My index contains many fields for which each > >> document > >> contains a single term (e.g. date, zipcode, country) and I need to > >> perform > >> range queries or set matches over these fields, many of which are > >> very inclusive (they match >10% of the total documents) > >> > >> A cached RangeFilter works well when there are a small number of > >> potential > >> options (e.g. for countries) but when there are many options > >> (consider a > >> date range or a set of zipcodes) there are too many potential > >> choices to > >> cache each possibility and it is too inefficient to build a filter > >> on the > >> fly for each query (as you have to visit 10% of documents to build > >> the filter despite the query itself matching 0.1%) > >> > >> Therefore I was considering building a int[reader.maxDocs()] array > >> for each > >> field and putting into it the term number for each document. This > >> relies on > >> the fact that each document contains only a single term for this > >> field, but > >> with it I should be able to quickly construct a ³multi-filter² > >> (that is, > >> something that iterates the array and checks that the term is in > >> the range > >> or set). > >> > >> Right now it looks like I can do some very ugly surgery and > >> perhaps use the > >> offset to the prx file even though it is not contiguous. But I¹m > >> hoping > >> there is a better technique that I¹m just not seeing right now. > >> > >> Thanks, > >> > >> Tim > > > > ------------------------------------------------------------------- > >-- To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]