Filters are more efficient than query terms for many things. For example, a RangeFilter is usually more efficient than a RangeQuery and has no risk of triggering BooleanQuery.TooManyClauses. And Filter caching (e.g., with CachingWrapperFilter) can make otherwise expensive clauses almost free, after the first time.
But filters are not obvious. Many Lucene applications that would benefit from them do not. Wouldn't it be better if we could automatically spot Query clauses which are amenable to filter-conversion? Then applications would just get faster and throw fewer exceptions, without having to know anything about filters.
From a user level I think this might work as follows:
1. Query clauses which have a boost of 0.0 are candidates for filter conversion, since they cannot contribute to the score. We should perhaps make boost=0 the default for certain classes of query (e.g., perhaps RangeQuery) or make subclasses with this as the default (KeywordQuery).
2. One should be able to specify a filter cache size per IndexSearcher, with the notion that each filter cached uses one bit per document.
I'm not yet clear how this should be implemented. It might be based on something like:
public interface DocIdCollector { void collectDocId(int docId); }
/** Collects all DocIds that match the query. DocIds are collected in no particular order and may be collected more than once. Returns true if this feature is supported, false otherwise. */ public boolean Query.getFilterBits(IndexReader, DocIdCollector);
Implementing this for various query classes is straight-forward. TermQuery might return null for any but very common terms (occurring in, e.g., greater than 10% of documents). RangeQuery would use the logic that's currently in RangeFilter. Etc.
BooleanScorer could then use this method to create a filter bit-vector for all of the boost=0.0 clauses, then use that to filter the other boost!=0 clauses. The bit vectors could be cached in the scorer (using a LinkedHashMap), although I'm a little fuzzy on exactly how the cache API would work.
I'm not convinced the above is the best design, but I am convinced Lucene needs a solution for this. It could automatically eliminate most causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also make many required keyword clauses (document type, language, etc.) much faster.
What do others think? Does anyone have a better design or improvements to what I describe?
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]