Separate issue, but this collector is not going to work with concurrent search since the sum is not updated in a thread safe manner. Maybe you don't care, since you don't use a thread pool to execute your queries, but you probably should!
On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpou...@gmail.com> wrote: > Hi Steven, > > This collector looks correct to me. Resetting the counter to 0 on the first > segment is indeed not necessary. > > We have plenty of collectors that are very similar to this one and we never > observed any double-counting issue. I would suspect an issue in the code > that calls this collector. Maybe try to print the stack trace under the ` > if (context.docBase == 0) {` check to see why your collector is being > called twice? > > On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < > stevenschlans...@gmail.com> wrote: > > > Hi Lucene users, > > > > I am developing a search application that needs to do some basic > > summary statistics. We use Lucene 8.9.0. > > To improve performance for e.g. summing a value across 10,000 > > documents, we are using DocValues as columnar storage. > > > > In order to retrieve the DocValues without collecting all hits into a > > TopDocs, which we determined to cause a lot of memory pressure and > > consume much time, we are using the expert Collector query interface. > > > > Here's the code, simplified a bit for the list: > > > > final collector = new Collector() { > > long sum = 0; > > > > @Override > > public ScoreMode scoreMode() { > > return ScoreMode.COMPLETE_NO_SCORES; > > } > > > > @Override > > public LeafCollector getLeafCollector(final LeafReaderContext > > context) throws IOException { > > if (context.docBase == 0) { > > sum = 0; // XXX: this should not be necessary? > > } > > final var subtotalValue = > > context.reader().getNumericDocValues("subtotal"); > > return new LeafCollector() { > > @Override > > public void setScorer(final Scorable scorer) throws > > IOException { > > } > > > > @Override > > public void collect(final int doc) throws IOException { > > if (subtotalValue.docID() > doc || > > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { > > return; > > } > > sum += subtotalValue.longValue(); > > } > > }; > > } > > } > > searcher.search(myQuery, collector); > > return collector.sum; > > > > The query is a moderately complicated Boolean query with some > > TermQuery and MultiTermQuery instances combined together. > > While first testing, I observed that seemingly the collector is called > > twice for each document, and the sum is exactly double what you would > > expect. > > > > It seems that the Collector is observing every matched document twice, > > and by printing out the Scorer, I see that it's done with two > > different BooleanScorer instances. > > You can see my hack that resets the collector every time it starts at > > docBase 0. which I am sure is not the right approach, but seems to > > work. > > What is the right pattern to ensure my Collector only observes result > > documents once, no matter the input query? I see a note in the > > documentation that state is supposed to be stored on the Scorer > > implementation, but I am not providing a custom Scorer, nor do I > > actually want any scoring at all. > > > > Thank you for any guidance! > > Steven > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > -- > Adrien >