Re: constant scoring queries

Paul Elschot Sun, 15 May 2005 04:04:21 -0700

On Wednesday 11 May 2005 04:42, Yonik Seeley wrote:
> Hey now... you're going to obsolete all my in-house code and put me
> out of a job ;-)


We all like cherry picking.

> Could you elaborate on the advantage of having say a TermQuery that
> could be either normal-scoring or constant-scoring vs two different
> Query classes for doing this?  They seem roughly equivalent.
> 
> 
> > 1. Add two methods to Query.java:
> > 
> >    public boolean constantScoring();
> >    public void constantScoring(boolean);
> > 
> >    When constantScoring(), the boost() is the score for matches.
> 
> That seems fine.

.. with the is.. prefix as suggested by Mark Schreiber.
 
> > 2. Add two methods to Searcher.java:
> > 
> >    public BitSet cachedBitSet(Query) { return null; }
> >    public void cacheBitSet(Query, BitSet) {}
> > 
> >    IndexSearcher overrides these to maintain an LRU cache of bitsets.
> 
> Yup, that's what I have.
> Things should be extensible and use a caching interface - the default
> implementation being an LRU cache, but users could use their own
> implementations to get LFU behavior or whatever.

To have the actual implementation of java.util.BitSet
in the interface is not really nice. The FilteredQuery here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
has two constructors, one for a Filter that provides a BitSet,
and one for SkipFilter that provides a DocNrSkipper, which
is an iterator that can skip (see below).
Making a DocNrSkipper out of a BitSet is no problem,
the underlying compact SortedVIntList has a constructor from a BitSet
(see the useSkipFilter method of this FilteredQuery).
  
> > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > is not thrown.
 
> This is good, but not sufficient for RangeQuery.  If
> RangeQuery.constantScoring(), then it should not rewrite to a
> BooleanQuery at all.  Depending on the RangeQuery, just the creation
> of a BooleanQuery that matches it is too heavyweight.

By itself, a BooleanQuery is no more than a collection of clauses.
For the case of a RangeQuery (and others with optional clauses),
the heavy weight can be avoided by iterating over the clauses one by
one. I think the idea is to have this iteration over clauses implemented in
one place, in BooleanQuery.
That means that a constant scoring query might also provide an iterator
over its clauses, ie. the ones that are to be OR'ed, ie subject to
disjunction.
  
> > Also, instead of BitSet we could use an interface:
> > 
> >    public interface DocIdSet {
> >      void add(int docId);
> >      boolean contains(int docId);
> >      int next(int docId);
> >    }
> > 
> > to permit sparse representations.
> 
> Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> implementation and a compact implementation that's an int hash set
> (unordered cause I just use it as a filter now).  Here is the basic
> interface:
> 
> public interface DocSet {
>   public int size();
>   public boolean exists(int docid);
>   public DocIterator iterator();
>   public BitSet getBits();
>   public long memSize();
>   public DocSet intersection(DocSet other);
>   public int intersectionSize(DocSet other);
>   public DocSet union(DocSet other);
>   public int unionSize(DocSet other);
> }

contains(docid) and exists(docid) cannot be efficiently implemented
on a VInt based compact filter, so I'd prefer to leave these out.
BitSet getBits() is also problematic, as it would need to create a BitSet.

The rest of this interface looks ok to me, and a negative intersection
for AND NOT may have to be added for excluded clauses.
 
> I would separate out int next(int docId) into an iterator.  It may be
> more efficient to iterate over certain structures if you can maintain
> state about where you are (and this may even be true of a BitSet).

Is the int next(int docId) is the same as the DocNrSkipper here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
ie. skipping over document number returned by the DocIterator above?
In DocNrSkipper the skipping can be suppressed by using currentDoc + 1
as argument, so these two might be combined into a single one, but
I don't know whether a hash based implementation would allow such
ordered access.

When I wrote the SkipFilter, I considered a hash table based implementation,
but I did not implement it because I thought it would use too much memory.
However, I did not check the memory usage of a java implementation of
a hashed document id set. Would you have some figures for memSize()
results of the DocSet above?

Anyway, the things above are converging nicely.
There are some further points for an implementation of cached query results:
- should the use of BitSet in public interfaces be deprecated in favour of
  a (skipping) document id iterator?
- the boolean (and,or,not) logic over the document id sets.
- the cache in IndexSearcher.

The boolean logic over doc id sets could be done in the corresponding 
scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
ReqExclScorer for not). Like Query, Scorer may have to be extended
with an isConstantScoring() method for this. The implementation of
DisjunctionScorer will probably need to be changed to use the iterator
over the clauses.

For the cache in IndexSearcher a LinkedHashMap seems good.
It can evict the cached least recently used filter(s) when the total memory
used by the cached filters exceeds a maximum.

This could also provide for changes in the underlying reader.
A callback interface (as mentioned by Robert Engels) similar to
updateQuery(Term t,BitSet newdocs);
could be used in for added documents, but when docs are deleted
this would not work well on the level of a reader. It might be
better to do this on Segments because these are immutable.
Associating a Query with a Segment is a far fetch at the moment,
though.


Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: constant scoring queries

Reply via email to