I think we've gone around in a loop here. It's exactly due to the inadequacy of cached filters that I'm considering what I'm doing.
Here's the section from my first email that is most illuminating: " The reason I have this question is that I am writing a multi-filter for single term fields. My index contains many fields for which each document contains a single term (e.g. date, zipcode, country) and I need to perform range queries or set matches over these fields, many of which are very inclusive (they match >10% of the total documents) A cached RangeFilter works well when there are a small number of potential options (e.g. for countries) but when there are many options (consider a date range or a set of zipcodes) there are too many potential choices to cache each possibility and it is too inefficient to build a filter on the fly for each query (as you have to visit 10% of documents to build the filter despite the query itself matching 0.1%) Therefore I was considering building a int[reader.maxDocs()] array for each field and putting into it the term number for each document. This relies on the fact that each document contains only a single term for this field, but with it I should be able to quickly construct a ³multi-filter² (that is, something that iterates the array and checks that the term is in the range or set). " Does this help explain my rationale? The reason I'm posting here is that I imagine there are lots of people with this issue. In particular date ranges seem to be something that lots of people use but Lucene implements fairly poorly. Tim On 11/10/08 1:58 PM, "Paul Elschot" <[EMAIL PROTECTED]> wrote: > Op Monday 10 November 2008 22:21:20 schreef Tim Sturge: >> Hmmm -- I hadn't thought about that so I took a quick look at the >> term vector support. >> >> What I'm really looking for is a compact but performant >> representation of a set of filters on the same (one term field). >> Using term vectors would mean an algorithm similar to: >> >> String myfield; >> String myterm; >> TermVector tv; >> for (int i = 0 ; i < maxDoc ; i++) { >> tv = reader.getTermFreqVector(i,country) >> if (tv.indexOf(myterm) != -1) { >> // include this doc... >> } >> } >> >> The key thing I am looking to achieve here is performance comparable >> to filters. I suspect getTermFremVector() is not efficient enough but >> I'll give it a try. >> > > Better use a TermDocs on myterm for this, have a look at the code of > RangeFilter. > > Filters are normally created from a slower query by setting a bit in an > OpenBitSet at "include this doc". Then they are reused for their speed. > > Filter caching could help. In case memory becomes a problem > and the filters are sparse enough, try and use SortedVIntList > as the underlying data structure in the cache. (Sparse enough means > less than 1 in 8 of all docs available the index reader.) > See also LUCENE-1296 for caching another data structure than the > one used to collect the filtered docs. > > Regards, > Paul Elschot > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]