Re: constant scoring queries

Paul Elschot Wed, 18 May 2005 00:37:56 -0700

On Tuesday 17 May 2005 19:35, Yonik Seeley wrote:
> > To have the actual implementation of java.util.BitSet
> > in the interface is not really nice.
> 
> Totally agree.
> 
> > The FilteredQuery here:
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > has two constructors, one for a Filter that provides a BitSet,
> > and one for SkipFilter that provides a DocNrSkipper, which
> > is an iterator that can skip (see below).
> > Making a DocNrSkipper out of a BitSet is no problem,
> > the underlying compact SortedVIntList has a constructor from a BitSet
> > (see the useSkipFilter method of this FilteredQuery).
> > 
> > > > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > > > is not thrown.
> > 
> > > This is good, but not sufficient for RangeQuery.  If
> > > RangeQuery.constantScoring(), then it should not rewrite to a
> > > BooleanQuery at all.  Depending on the RangeQuery, just the creation
> > > of a BooleanQuery that matches it is too heavyweight.
> > 
> > By itself, a BooleanQuery is no more than a collection of clauses.
> > For the case of a RangeQuery (and others with optional clauses),
> > the heavy weight can be avoided by iterating over the clauses one by
> > one. I think the idea is to have this iteration over clauses implemented 
in
> > one place, in BooleanQuery.
> 
> But for certain range queries you want to avoid even materializing
> that collection of clauses (currently done in RangeQuery.rewrite())
> since it can number in the millions.
> 
> > That means that a constant scoring query might also provide an iterator
> > over its clauses, ie. the ones that are to be OR'ed, ie subject to
> > disjunction.
> 
> Ahhh, I see...


I realize just now that this also means that a simple flag on a Query
will not be enough.

> > > > Also, instead of BitSet we could use an interface:
> > > >
> > > >    public interface DocIdSet {
> > > >      void add(int docId);
> > > >      boolean contains(int docId);
> > > >      int next(int docId);
> > > >    }
> > > >
> > > > to permit sparse representations.
> > >
> > > Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> > > implementation and a compact implementation that's an int hash set
> > > (unordered cause I just use it as a filter now).  Here is the basic
> > > interface:
> > >
> > > public interface DocSet {
> > >   public int size();
> > >   public boolean exists(int docid);
> > >   public DocIterator iterator();
> > >   public BitSet getBits();
> > >   public long memSize();
> > >   public DocSet intersection(DocSet other);
> > >   public int intersectionSize(DocSet other);
> > >   public DocSet union(DocSet other);
> > >   public int unionSize(DocSet other);
> > > }
> > 
> > contains(docid) and exists(docid) cannot be efficiently implemented
> > on a VInt based compact filter, so I'd prefer to leave these out.
> 
> exists() on a BitSet is much faster than next() though... 

Yes, but the point is to iterate to the next document based in information
from RAM and to be able to skipTo() on the index instead of reading it 
sequentially.
 
> > BitSet getBits() is also problematic, as it would need to create a BitSet.
> 
> Agree.  Any method like that should be specific to a subclass of
> DocIdSet (the one that implements it as a BitSet).
> 
> > When I wrote the SkipFilter, I considered a hash table based 
implementation,
> > but I did not implement it because I thought it would use too much memory.
> > However, I did not check the memory usage of a java implementation of
> > a hashed document id set. Would you have some figures for memSize()
> > results of the DocSet above?
> 
> I use a power-of-two hash table with a load factor of .75.  So putting
> 500 docs in my hashset would take up 1024 slots at 4 bytes per slot
> (4k).

So about 8 bytes per doc. A SortedVIntList normally has 1 byte per doc,
and never more than 4 bytes per doc (as long as doc numbers are int).
 
> Hashes also have big speed advantage over BitSets for iterating over
> all docs or taking intersection sizes.  The hash also has fast random
> access that docNrSkipper doesn't have.

Can it take determine the intersection size faster than iterating over
both sets with an intersecting merge?
 
> If lucene goes to iterating over filters in order  (via docNrSkipper &
> your new FilteredQuery), a hash implementation no longer makes sense.

The new BooleanScorer is based on iterating over documents in order.
There may be a performance advantage somewhere for
filters with hash implementations and I'd rather not miss that.
 
> > Anyway, the things above are converging nicely.
> > There are some further points for an implementation of cached query 
results:
> > - should the use of BitSet in public interfaces be deprecated in favour of
> >   a (skipping) document id iterator?
> 
> BitSet as the only way to access a filter should be depracated. 
> However, *If* a filter implementation internally uses a BitSet, it
> might be nice to be able to get access to it.

An instanceof and a cast should be able to provide that.

> > - the boolean (and,or,not) logic over the document id sets.
> 
> Absolutely.  And functions that provide counts too (can be more
> efficient than functions that have to materialize the resulting set).
> 
> > - the cache in IndexSearcher.
> Yes!  But user extensible please...

I'd prefer to start with a simple implementation, deferring subclassing or
implementing an interface for later.
 
> > The boolean logic over doc id sets could be done in the corresponding
> > scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
> > ReqExclScorer for not). Like Query, Scorer may have to be extended
> > with an isConstantScoring() method for this. The implementation of
> > DisjunctionScorer will probably need to be changed to use the iterator
> > over the clauses.

In case the filters are wrapped in a Scorer all the boolean logic is already 
implemented in these scorers. DisjunctionScorer should probably target a
BitSet when constant scoring.
 
> > For the cache in IndexSearcher a LinkedHashMap seems good.
> > It can evict the cached least recently used filter(s) when the total 
memory
> > used by the cached filters exceeds a maximum.
> 
> That's what I use.
> 
> Will setConstantScore(true) on a query delegate that call to all
> subqueries?  It would seem very useful to be able to easily switch an

As a minimum, setConstantScore would ensure that Scorer.score() 
will not be called for the scorers of the subqueries.

> entire query to constant scoring (for instance when a non-score sort
> is specified and the user doesn't care about the score).

Sounds perfect to me, but I've never looked at the sorting code in depth.
Do the non-score sort methods use HitCollector?
That might be wasteful because it provides the score value for each doc.

As for implementing all of this constant scoring and filter caching,
I would like to make sure that the new BooleanScorer works close
to perfection before continuing.
In particular there are some patches for which I'd like to know
whether they are acceptable:
a patch for DisjunctionSumScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34193
and a split off of the coordination:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34154

In fact, since this builds on the new scorer, I'd prefer to have that
used widely before continuing.
Also, I have no idea about the order in which order this constantScoring
and filter caching could be easily implemented. Any ideas?

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: constant scoring queries

Reply via email to