Hi Steven, This collector looks correct to me. Resetting the counter to 0 on the first segment is indeed not necessary.
We have plenty of collectors that are very similar to this one and we never observed any double-counting issue. I would suspect an issue in the code that calls this collector. Maybe try to print the stack trace under the ` if (context.docBase == 0) {` check to see why your collector is being called twice? On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < stevenschlans...@gmail.com> wrote: > Hi Lucene users, > > I am developing a search application that needs to do some basic > summary statistics. We use Lucene 8.9.0. > To improve performance for e.g. summing a value across 10,000 > documents, we are using DocValues as columnar storage. > > In order to retrieve the DocValues without collecting all hits into a > TopDocs, which we determined to cause a lot of memory pressure and > consume much time, we are using the expert Collector query interface. > > Here's the code, simplified a bit for the list: > > final collector = new Collector() { > long sum = 0; > > @Override > public ScoreMode scoreMode() { > return ScoreMode.COMPLETE_NO_SCORES; > } > > @Override > public LeafCollector getLeafCollector(final LeafReaderContext > context) throws IOException { > if (context.docBase == 0) { > sum = 0; // XXX: this should not be necessary? > } > final var subtotalValue = > context.reader().getNumericDocValues("subtotal"); > return new LeafCollector() { > @Override > public void setScorer(final Scorable scorer) throws > IOException { > } > > @Override > public void collect(final int doc) throws IOException { > if (subtotalValue.docID() > doc || > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { > return; > } > sum += subtotalValue.longValue(); > } > }; > } > } > searcher.search(myQuery, collector); > return collector.sum; > > The query is a moderately complicated Boolean query with some > TermQuery and MultiTermQuery instances combined together. > While first testing, I observed that seemingly the collector is called > twice for each document, and the sum is exactly double what you would > expect. > > It seems that the Collector is observing every matched document twice, > and by printing out the Scorer, I see that it's done with two > different BooleanScorer instances. > You can see my hack that resets the collector every time it starts at > docBase 0. which I am sure is not the right approach, but seems to > work. > What is the right pattern to ensure my Collector only observes result > documents once, no matter the input query? I see a note in the > documentation that state is supposed to be stored on the Scorer > implementation, but I am not providing a custom Scorer, nor do I > actually want any scoring at all. > > Thank you for any guidance! > Steven > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Adrien