On Wednesday 07 January 2009 07:36:06 John Wang wrote:
> Hi guys:
>
>      We have been building a suite of boolean operators DocIdSets
> (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator,
> NotDocIdSet/Iterator). We compared our implementation on the
> OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some
> code tuning, and we see the performance doubled in our testing. 

That's good news.
What data structure did you use for sorting by doc id?
Currently a priority queue is used for that, and normally that is
the bottleneck for performance.

> (we
> haven't done comparisons with ConjuctionScorer vs.
> AndDocIdSetIterator, will post numbers when we do)
>
>      We'd be happy to contribute this back to the community. But what
> is the best way of going about it?
>
> option 1: merge our change into DisjunctionMax/SumScorers.
> option 2: contribute boolean operator sets, and have
> DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer
> derive from AndDocIdSetIterator etc.
>
>      Option 2 seems to be cleaner. Thoughts?

Some theoretical performance improvement is possible when the
minimum number of required scorers/iterators is higher than 1,
by using of skipTo() (as much as possible) instead of next() in
such cases. For the moment that's theoretical because there
is no working implementation of this yet, but have a look at
LUCENE-1345 .

I'm currently working on a DisjunctionDISI, probably the same function 
as the OrDocIdSetIterator you mentioned above. In case you have
something faster than that, could you post it at LUCENE-1345 or at a
new issue?

An AndDocIdSetIterator could also be useful for the PhraseScorers and
for the SpanNear queries, but that is of later concern.

So I'd prefer option 2.

Regards,
Paul Elschot

Reply via email to