Li, I can give you my understanding of difference in DisjunctionSumScorer and https://issues.apache.org/jira/browse/LUCENE-2686
When you choose steady children approach, even you omit call score() you have to enumerate top of child scorers heap twice. Check nextDoc() from the patch: first loop pushes top of heap first, then the second loop through the top is done by countMatches(1); countMatches(2); Current DisjunctionSumScorer enumerate top of heap once per every doc: while it loops through top of heap it counts number of clauses matched; accumulate score; and pushes top children one step forward. what's minimumNrMatchers do you have? can you upload your test github? (mail list rips attachments off) On Thu, Apr 19, 2012 at 7:34 AM, Li Li <[email protected]> wrote: > Michael McCandless wrote: > > So... the good news is I made a new scorer (basically copied > DisjunctionMaxScorer and then tweaked from there) that scores the OR-only > case. All tests pass w/ this new scorer. > > And more good news is that if you don't score (I sort by doctitle to do > that), you get a speedup – 7.7% in my simplistic test (prefix query unit*, > expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force > it to use BS2 not BS – the nocommits in the patch). > > But the bad news is with scoring on it's 22.7% slower! > > And, the weird news is, I discovered accidentally that BS2 is much (> 2X) > faster for this one query. I think we need to modify the criteria that > decides whether to use BS or BS2... maybe when there are lots of > lowish-docFreq terms, BS2 is better? > --------------- > why new algorithm is 22% slower than old one? > I read the codes and feel them almost the same. > > On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev < > [email protected]> wrote: > >> Hello, >> >> I can't help with the particular question, but can share some experience. >> My task is roughly the same I've found the patch >> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful >> (with one small addition, I'll post it in comments soon). By using it I >> have disjunction summing query with steady subscorers. >> >> Regards >> >> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <[email protected]> wrote: >> >>> hi all, >>> I am now hacking the BooleanScorer2 to let it keep the docID() of >>> the leaf scorer(mostly possible TermScorer) the same as the top-level >>> Scorer. Why I want to do this is: When I Collect a doc, I want to know >>> which term is matched(especially for BooleanClause whose Occur is SHOULD). >>> we have discussed some solutions, such as adding bit masks in disjunction >>> scorers. with this method, when we finds a matched doc, we can recursively >>> find which leaf scorer is matched. But we think it's not very efficient and >>> not convenient to use(this is my proposal but not agreed by others in our >>> team). and then we came up with another one: Modifying DisjunctionSumScorer. >>> we analysed the codes and found that the only Scorers used by >>> BooleanScorer2 that will make the children scorers' docID() not equal to >>> parent is an anonymous class inherited from DisjunctionSumScorer. All other >>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous), >>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. >>> The implementation algorithm of DisjunctionSumScorer use a heap to >>> find the smallest doc. after finding a matched doc, the currentDoc is the >>> matched doc and all the scorers in the heap will call nextDoc() so all of >>> the scorers' current docID the nextDoc of currentDoc. if there are N level >>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId >>> of the root of the scorer tree. >>> So we modify the DisjuctionSumScorer and let it behavior as we >>> expected. And then I wrote some TestCase and it works well. And also I >>> wrote some random generated TermScorer and compared the nextDoc(),score() >>> and advance(int) method of original DisjunctionSumScorer and modified one. >>> nextDoc() and score() and exactly the same. But for advance(int target), we >>> found some interesting and strange things. >>> at the beginning, I think if target is less than current docID, it >>> will just return current docID and do nothing. this assumption let my >>> algorithm go wrong. Then I read the codes of TermScorer and found each call >>> of advance(int) of TermScorer will call nextDoc() no matter whether current >>> docID is larger than target or not. >>> So I am confused and then read the javadoc of DocIdSetIterator: >>> ----------------- javadoc of DocIdSetIterator.advance(int >>> target)------------- >>> >>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws >>> IOException >>> >>> Advances to the first beyond (see NOTE below) the current whose document >>> number is greater than or equal >>> to target. Returns the current document number or NO_MORE_DOCS if there >>> are no more docs in the set. >>> Behaves as if written: >>> int advance(int target) { >>> int doc; >>> while ((doc = nextDoc()) < target) { >>> } >>> return doc; >>> } >>> Some implementations are considerably more efficient than that. >>> NOTE: when target < current implementations may opt not to advance >>> beyond their current docID(). >>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some >>> Scorers. If your >>> implementation cannot efficiently determine that it should exhaust, it >>> is recommended that you check for >>> that value in each call to this method. >>> NOTE: after the iterator has exhausted you should not call this method, >>> as it may result in unpredicted >>> behavior. >>> -------------------------------------- >>> Then I modified my algorithm again and found that >>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of >>> the cases, it will return currentDoc if target < currentDoc. but in some >>> boundary condition, it will not. >>> it's not a bug but let me sad. I thought my algorithm has some bug >>> because it's advance method is not exactly the same as original >>> DisjunctionSumScorer's. >>> ----codes of DisjunctionSumScorer--- >>> @Override >>> public int advance(int target) throws IOException { >>> if (scorerDocQueue.size() < minimumNrMatchers) { >>> return currentDoc = NO_MORE_DOCS; >>> } >>> if (target <= currentDoc) { >>> return currentDoc; >>> } >>> .... >>> ------------------- >>> for most case if (target <= currentDoc) it will return currentDoc; >>> but if previous advance will make sub scorers exhausted, then if may >>> return NO_MORE_DOCS >>> an example is: >>> currentDoc=-1 >>> minimumNrMatchers=1 >>> subScorers: >>> TermScorer: docIds: [1, 2, 6] >>> TermScorer: docIds: [2, 4] >>> after first call advance(5) >>> currentDoc=6 >>> only first scorer is now in the heap, scorerDocQueue.size()==1 >>> then call advance(6) >>> because scorerDocQueue.size() < minimumNrMatchers, it just return >>> NO_MORE_DOCS >>> >>> My question is why the advance(int target) method is defined like this? >>> for the reason of efficient or any other reasons? >>> >>> >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> [email protected] >> >> <http://www.griddynamics.com> >> <[email protected]> >> >> > -- Sincerely yours Mikhail Khludnev [email protected] <http://www.griddynamics.com> <[email protected]>
