Filters are more efficient than query terms for many
I think there are two reasons for the peformance gain: - having things in RAM, eg. the bits of a filter after it is computed once, - being able to search per field instead of per document.
Also, bit-vectors are constant-time to access.
As a first impression I don't like using a boost value for this purpose.
This would probably introduce problems for negative weights and negative scores, even though these are currently not used.
I'd rather keep the boosts and score values continuous and without limits.
I've never been convinced that negative weights are useful. Do you think that they are?
Perhaps a better way to specify that some parts of a query have yes/no behaviour would be by designating a set of fields as 'pure boolean' or 'filtering', and pass this set to a query parser. Compared to the standard query parser, a query parser like that would only need to override some get...Query() methods on the basis of this set of fields. Typical 'filtering' fields are dates and primary keys.
As a design principal, Lucene has tried to avoid forcing folks to declare much about their documents and fields ahead of time. Indexes with different fields indexed differently may be freely intermixed. Perhaps this is not worth preserving, but neither should we give it up lightly.
In some cases it is possible to have better memory efficiency than one bit per document, see the compact sparse filter utilities I posted yesterday http://issues.apache.org/bugzilla/show_bug.cgi?id=32921 I think this is most useful for reducing the filter cache size after various passes of collecting document id's on one or more BitSets.
This is great stuff! Perhaps we should have a wrapper implemenation that, when the bit-density is less than 1/8 uses this representation, and when the bit-density is greater than 1/8 converts to a bit vector?
I fully agree. BooleanScorer should first try and do all 'pure boolean'/ 'filtering' work and then continue to determine the scores of the passing documents.
A possible design refinement:
The 'pure boolean' queries could provide a PureBooleanScorer
(subclass of Scorer) that throws an UnsupportedOperation exception for
score(). These could then implement the Query.getFilterBits() operation above.
This API violates the "don't use exceptions for control flow" rule... Is your goal to get an efficient skipTo() for pure boolean queries?
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]