On Wednesday 07 January 2009 07:36:06 John Wang wrote: > Hi guys: > > We have been building a suite of boolean operators DocIdSets > (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, > NotDocIdSet/Iterator). We compared our implementation on the > OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some > code tuning, and we see the performance doubled in our testing.
That's good news. What data structure did you use for sorting by doc id? Currently a priority queue is used for that, and normally that is the bottleneck for performance. > (we > haven't done comparisons with ConjuctionScorer vs. > AndDocIdSetIterator, will post numbers when we do) > > We'd be happy to contribute this back to the community. But what > is the best way of going about it? > > option 1: merge our change into DisjunctionMax/SumScorers. > option 2: contribute boolean operator sets, and have > DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer > derive from AndDocIdSetIterator etc. > > Option 2 seems to be cleaner. Thoughts? Some theoretical performance improvement is possible when the minimum number of required scorers/iterators is higher than 1, by using of skipTo() (as much as possible) instead of next() in such cases. For the moment that's theoretical because there is no working implementation of this yet, but have a look at LUCENE-1345 . I'm currently working on a DisjunctionDISI, probably the same function as the OrDocIdSetIterator you mentioned above. In case you have something faster than that, could you post it at LUCENE-1345 or at a new issue? An AndDocIdSetIterator could also be useful for the PhraseScorers and for the SpanNear queries, but that is of later concern. So I'd prefer option 2. Regards, Paul Elschot