some mistakes of the example:
after first call advance(5)
currentDoc=6
first scorer's nextDoc is called to in advance, the heap is empty now.
then call advance(6)
because scorerDocQueue.size() < minimumNrMatchers, it just return
NO_MORE_DOCS
On Tue, Apr 17, 2012 at 6:37 PM, Li Li <[email protected]> wrote:
> hi all,
> I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
> we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
> The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
> So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
> at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
> So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
> to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
> int advance(int target) {
> int doc;
> while ((doc = nextDoc()) < target) {
> }
> return doc;
> }
> Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
> implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
> that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
> behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
> @Override
> public int advance(int target) throws IOException {
> if (scorerDocQueue.size() < minimumNrMatchers) {
> return currentDoc = NO_MORE_DOCS;
> }
> if (target <= currentDoc) {
> return currentDoc;
> }
> ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
> currentDoc=-1
> minimumNrMatchers=1
> subScorers:
> TermScorer: docIds: [1, 2, 6]
> TermScorer: docIds: [2, 4]
> after first call advance(5)
> currentDoc=6
> only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
> because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>