Hmmm -- I hadn't thought about that so I took a quick look at the term vector support.
What I'm really looking for is a compact but performant representation of a set of filters on the same (one term field). Using term vectors would mean an algorithm similar to: String myfield; String myterm; TermVector tv; for (int i = 0 ; i < maxDoc ; i++) { tv = reader.getTermFreqVector(i,country) if (tv.indexOf(myterm) != -1) { // include this doc... } } The key thing I am looking to achieve here is performance comparable to filters. I suspect getTermFremVector() is not efficient enough but I'll give it a try. Tim On 11/10/08 10:54 AM, "Paul Elschot" <[EMAIL PROTECTED]> wrote: > Tim, > > I didn't follow all the details, so this may be somewhat off, > but did you consider using TermVectors? > > Regards, > Paul Elschot > > > Op Monday 10 November 2008 19:18:38 schreef Tim Sturge: >> Yes, that is a significant issue. What I'm coming to realize is that >> either I will end up with something like >> >> class MultiFilter { >> String field; >> private int[] termInDoc; >> Map<Term,int> termToInt; >> ... >> } >> >> which can be entirely built on the current lucene APIs but has >> significantly more overhead (the termToInt mapping in particular and >> the need to construct the mapping and array on startup) >> >> >> Or I can go deep into the guts and add a data file per-segment with a >> format something like >> >> int version >> int numFields >> (int fieldNum, long offset) ^ numFields >> (int termForDoc) ^ (maxDocs * numFields) >> >> and add something to FieldInfo like boolean storeMultiFilter; >> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to >> add an int termNum to the .tis file as well. >> >> This is clearly a lot more work than the first solution, but it is a >> lot nicer to deal with as well. Is this interesting to anyone other >> than me? >> >> Tim >> >> On 11/9/08 12:23 PM, "Michael McCandless" <[EMAIL PROTECTED]> > wrote: >>> Conceivably, TermInfosReader could track the sequence number of >>> each term. >>> >>> A seek/skipTo would know which sequence number it just jumped too, >>> because the index is regular (every 128 terms by default), and then >>> each next() call could increment that. Then retrieving this number >>> would be as costly as calling eg IndexReader.docFreq(Term) is now. >>> >>> But I'm not sure how a multi-segment index would work, ie how >>> would MultiSegmentReader compute this for its terms? Or maybe >>> you'd just do this per-segment? >>> >>> Mike >>> >>> Tim Sturge wrote: >>>> Hi, >>>> >>>> I¹m wondering if there is any easy technique to number the terms >>>> in an index >>>> (By number I mean map a sequence of terms to a contiguous range of >>>> integers >>>> and map terms to these numbers efficiently) >>>> >>>> Looking at the Term class and the .tis/.tii index format it >>>> appears that the >>>> terms are stored in an ordered and prefix-compressed format, but >>>> while there >>>> are pointers from a term to the .frq and .prx files, neither is >>>> really suitable as a sequence number. >>>> >>>> The reason I have this question is that I am writing a >>>> multi-filter for >>>> single term fields. My index contains many fields for which each >>>> document >>>> contains a single term (e.g. date, zipcode, country) and I need to >>>> perform >>>> range queries or set matches over these fields, many of which are >>>> very inclusive (they match >10% of the total documents) >>>> >>>> A cached RangeFilter works well when there are a small number of >>>> potential >>>> options (e.g. for countries) but when there are many options >>>> (consider a >>>> date range or a set of zipcodes) there are too many potential >>>> choices to >>>> cache each possibility and it is too inefficient to build a filter >>>> on the >>>> fly for each query (as you have to visit 10% of documents to build >>>> the filter despite the query itself matching 0.1%) >>>> >>>> Therefore I was considering building a int[reader.maxDocs()] array >>>> for each >>>> field and putting into it the term number for each document. This >>>> relies on >>>> the fact that each document contains only a single term for this >>>> field, but >>>> with it I should be able to quickly construct a ³multi-filter² >>>> (that is, >>>> something that iterates the array and checks that the term is in >>>> the range >>>> or set). >>>> >>>> Right now it looks like I can do some very ugly surgery and >>>> perhaps use the >>>> offset to the prx file even though it is not contiguous. But I¹m >>>> hoping >>>> there is a better technique that I¹m just not seeing right now. >>>> >>>> Thanks, >>>> >>>> Tim >>> >>> ------------------------------------------------------------------- >>> -- To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]