Querying into a Collector visits documents multiple times

Steven Schlansker Tue, 21 Sep 2021 12:30:51 -0700

Hi Lucene users,

I am developing a search application that needs to do some basic
summary statistics. We use Lucene 8.9.0.
To improve performance for e.g. summing a value across 10,000
documents, we are using DocValues as columnar storage.


In order to retrieve the DocValues without collecting all hits into a
TopDocs, which we determined to cause a lot of memory pressure and
consume much time, we are using the expert Collector query interface.

Here's the code, simplified a bit for the list:

final collector = new Collector() {
    long sum = 0;

    @Override
    public ScoreMode scoreMode() {
        return ScoreMode.COMPLETE_NO_SCORES;
    }

    @Override
    public LeafCollector getLeafCollector(final LeafReaderContext
context) throws IOException {
         if (context.docBase == 0) {
            sum = 0; // XXX: this should not be necessary?
        }
        final var subtotalValue =
context.reader().getNumericDocValues("subtotal");
        return new LeafCollector() {
            @Override
            public void setScorer(final Scorable scorer) throws IOException {
            }

            @Override
            public void collect(final int doc) throws IOException {
                    if (subtotalValue.docID() > doc ||
!subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
                        return;
                    }
                    sum += subtotalValue.longValue();
            }
        };
    }
}
searcher.search(myQuery, collector);
return collector.sum;

The query is a moderately complicated Boolean query with some
TermQuery and MultiTermQuery instances combined together.
While first testing, I observed that seemingly the collector is called
twice for each document, and the sum is exactly double what you would
expect.

It seems that the Collector is observing every matched document twice,
and by printing out the Scorer, I see that it's done with two
different BooleanScorer instances.
You can see my hack that resets the collector every time it starts at
docBase 0. which I am sure is not the right approach, but seems to
work.
What is the right pattern to ensure my Collector only observes result
documents once, no matter the input query? I see a note in the
documentation that state is supposed to be stored on the Scorer
implementation, but I am not providing a custom Scorer, nor do I
actually want any scoring at all.

Thank you for any guidance!
Steven

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Querying into a Collector visits documents multiple times

Reply via email to