Re: auto-filters?

Paul Elschot Thu, 06 Jan 2005 15:26:24 -0800

On Thursday 06 January 2005 22:31, Doug Cutting wrote:
> Paul Elschot wrote:
> >>Filters are more efficient than query terms for many 
> > 
> > I think there are two reasons for the peformance gain:
> > - having things in RAM, eg. the bits of a filter after it is computed 
once,
> > - being able to search per field instead of per document.
> 
> Also, bit-vectors are constant-time to access.


They may get swapped out for being too large and/or too numerous.
That is one case I'd like to avoid with the compact sparse filter.

> > As a first impression I don't like using a boost value for this purpose.
> > This would probably introduce problems for negative weights 
> > and negative scores, even though these are currently not used.
> > I'd rather keep the boosts and score values continuous and 
> > without limits.
> 
> I've never been convinced that negative weights are useful.  Do you 
> think that they are?

For general queries, negative weights are indeed not useful.
I have only used negative weights in two way classifiers.

Also, query weights and score values are different things.

Using a particular score value (0.0f) to carry the meaning of boolean false
is confusing for a Scorer that already has a way to determine whether or not a
document is included in its results (doc() in particular). See also the patch
to FilteredQuery, which replaces the case of a resulting 0.0f score in
with the use of the filter bits for next() and skipTo().
 
> > Perhaps a better way to specify that some parts of a query have yes/no
> > behaviour would be by designating a set of fields as 'pure boolean' or
> > 'filtering', and pass this set to a query parser.
> > Compared to the standard query parser, a query parser like that would
> > only need to override some get...Query() methods on the basis
> > of this set of fields.
> > Typical 'filtering' fields are dates and primary keys.
> 
> As a design principal, Lucene has tried to avoid forcing folks to 
> declare much about their documents and fields ahead of time.  Indexes 
> with different fields indexed differently may be freely intermixed. 
> Perhaps this is not worth preserving, but neither should we give it up 
> lightly.

The information on how to use the fields is not needed before query time.
Still a query weight of 0 might be useful for exceptions to such predetermined
field usage in queries.
Also, this is about query weights, so this is not the same point as the one
above for score values.

I don't think this violates the free mixing of fields.
It would invite folks to think ahead about how to use fields in their
queries,  which is actually ok.
 
> > In some cases it is possible to have better memory efficiency than one
> > bit per document, see the compact sparse filter utilities I posted 
yesterday
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
> > I think this is most useful for reducing the filter cache size after 
various
> > passes of collecting document id's on one or more BitSets.
> 
> This is great stuff!  Perhaps we should have a wrapper implemenation 
> that, when the bit-density is less than 1/8 uses this representation, 
> and when the bit-density is greater than 1/8 converts to a bit vector?

It's convenient to start each Filter as a BitSet, and a good
moment to consider conversion to a more compact representation is when
entering the cache associated with the IndexReader/IndexSearcher.

> > I fully agree. BooleanScorer should first try and do all 'pure boolean'/ 
> > 'filtering' work and then continue to determine the scores of the passing 
> > documents.
> > A possible design refinement:
> > The 'pure boolean' queries could provide a PureBooleanScorer
> > (subclass of Scorer) that throws an UnsupportedOperation exception for
> > score(). These could then implement the Query.getFilterBits() operation 
above.
> 
> This API violates the "don't use exceptions for control flow" rule... 

After some more consideration I think that this PureBooleanScorer
should actually be a superclass of Scorer without a score() method.
The Scorer subclass could add the score() method.
At the moment I can't tell whether a superclass like that would actually
be useful.

> Is your goal to get an efficient skipTo() for pure boolean queries?

It's more the other way around: skipTo() provides efficiency in larger
indexes, so I'd like to have a good implementation available in all scorers,
pure boolean and mixed ones alike.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: auto-filters?

Reply via email to