On Tuesday 10 May 2005 22:39, Doug Cutting wrote: > Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673, > Yonik Seely proposes a ConstantScoreQuery, based on a Filter. And in > http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg08007.html > I proposed a mechanism to promote the use of Filters. Through all of > this, Paul Elshot has hinted that there might be a better way.
Well, I did not think of combining constant scores with evaluating clauses one by one, but it makes for a nice fit in mixing pure boolean as constant scoring and normally scored queries. See below for some more comments. > Here's another proposal, tackling many of the same issues: > > 1. Add two methods to Query.java: > > public boolean constantScoring(); > public void constantScoring(boolean); > > When constantScoring(), the boost() is the score for matches. Since many types of queries can become constant scoring, and attribute like this is better than a separate query class for constant scoring queries. > 2. Add two methods to Searcher.java: > > public BitSet cachedBitSet(Query) { return null; } > public void cacheBitSet(Query, BitSet) {} > > IndexSearcher overrides these to maintain an LRU cache of bitsets. A cache of filters might end up having different implementations of filters to save memory, see also below. > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses > is not thrown. TooManyClauses in BooleanQuery works nicely, but I prefer not to have a direct association between BooleanQuery and TooManyClauses. Instead I'd rather have a factory for the term scorers that use buffer space (TermScorer and SpanTermScorer) and have this factory throw TooManyClauses. The factory would be passed to the rewrite() method. This is the approach taken in the Surround query language, which has two different factories, but they could be easily combined. > 4. Modify BooleanScorer to, if constantScoring(), > - check Searcher for a cached bitset > - failing that, create a bitset > - evaluate clauses serially, saving results in bitset That would only be possible for optional clauses, ie. the ones passed to a DisjunctionScorer in the trunk. For this case, the DisjunctionScorer would need to be extended. Btw. this would be even faster than any available boolean scorer since there is no need to compare doc id's in any way during scoring. > - cache the bitset Or a compact version of it as in http://issues.apache.org/bugzilla/show_bug.cgi?id=32921 ? > - use the bitset to handle doc(), next() and skipTo(); and implement boolean AND, OR and NOT when combining filters. > > 5. TermQuery and PhraseQuery could be similarly modified, so that, when > constant scoring, bitsets are cached for very common terms (e.g., >5% of > documents). > > With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when > declared to be constant scoring, will operate much faster and never > throw TooManyClauses. We can add an option (the default?) to > QueryParser to make these constant scoring. > > Also, instead of BitSet we could use an interface: > > public interface DocIdSet { > void add(int docId); > boolean contains(int docId); > int next(int docId); > } > > to permit sparse representations. The compact sparse representation referred to above stores the differences between doc id's as VInt's, so it can only provide an iterator over document numbers and not a random access as in contains(int) above. It needs only be used when it uses less memory than a bitset, which is the case when it allows rougly less than 1/8 of the docs of the reader. > Thoughts? Yes, thanks for combining the constant score with the one by one approach :) Regards, Paul Elschot. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]