Filters are more efficient than query terms for many things. For
example, a RangeFilter is usually more efficient than a RangeQuery and
has no risk of triggering BooleanQuery.TooManyClauses. And Filter
caching (e.g., with CachingWrapperFilter) can make otherwise expensive
clauses almost free, after the first time.

But filters are not obvious. Many Lucene applications that would
benefit from them do not. Wouldn't it be better if we could
automatically spot Query clauses which are amenable to
filter-conversion? Then applications would just get faster and throw
fewer exceptions, without having to know anything about filters.

From a user level I think this might work as follows:

1. Query clauses which have a boost of 0.0 are candidates for filter
conversion, since they cannot contribute to the score.  We should
perhaps make boost=0 the default for certain classes of query (e.g.,
perhaps RangeQuery) or make subclasses with this as the default
(KeywordQuery).

2. One should be able to specify a filter cache size per IndexSearcher,
with the notion that each filter cached uses one bit per document.

I'm not yet clear how this should be implemented.  It might be based on
something like:

  public interface DocIdCollector {
    void collectDocId(int docId);
  }

  /** Collects all DocIds that match the query.  DocIds are collected
      in no particular order and may be collected more than once.
      Returns true if this feature is supported, false otherwise. */
  public boolean Query.getFilterBits(IndexReader, DocIdCollector);

Implementing this for various query classes is straight-forward.
TermQuery might return null for any but very common terms (occurring in,
e.g., greater than 10% of documents).  RangeQuery would use the logic
that's currently in RangeFilter.  Etc.

BooleanScorer could then use this method to create a filter bit-vector
for all of the boost=0.0 clauses, then use that to filter the other
boost!=0 clauses.  The bit vectors could be cached in the scorer (using
a LinkedHashMap), although I'm a little fuzzy on exactly how the cache
API would work.

I'm not convinced the above is the best design, but I am convinced
Lucene needs a solution for this.  It could automatically eliminate most
causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also
make many required keyword clauses (document type, language, etc.) much
faster.

What do others think?  Does anyone have a better design or improvements
to what I describe?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to