Re: Querying into a Collector visits documents multiple times

Adrien Grand Wed, 22 Sep 2021 05:31:36 -0700

Hi Steven,

This collector looks correct to me. Resetting the counter to 0 on the first
segment is indeed not necessary.


We have plenty of collectors that are very similar to this one and we never
observed any double-counting issue. I would suspect an issue in the code
that calls this collector. Maybe try to print the stack trace under the `
if (context.docBase == 0) {` check to see why your collector is being
called twice?

On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

> Hi Lucene users,
>
> I am developing a search application that needs to do some basic
> summary statistics. We use Lucene 8.9.0.
> To improve performance for e.g. summing a value across 10,000
> documents, we are using DocValues as columnar storage.
>
> In order to retrieve the DocValues without collecting all hits into a
> TopDocs, which we determined to cause a lot of memory pressure and
> consume much time, we are using the expert Collector query interface.
>
> Here's the code, simplified a bit for the list:
>
> final collector = new Collector() {
>     long sum = 0;
>
>     @Override
>     public ScoreMode scoreMode() {
>         return ScoreMode.COMPLETE_NO_SCORES;
>     }
>
>     @Override
>     public LeafCollector getLeafCollector(final LeafReaderContext
> context) throws IOException {
>          if (context.docBase == 0) {
>             sum = 0; // XXX: this should not be necessary?
>         }
>         final var subtotalValue =
> context.reader().getNumericDocValues("subtotal");
>         return new LeafCollector() {
>             @Override
>             public void setScorer(final Scorable scorer) throws
> IOException {
>             }
>
>             @Override
>             public void collect(final int doc) throws IOException {
>                     if (subtotalValue.docID() > doc ||
> !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
>                         return;
>                     }
>                     sum += subtotalValue.longValue();
>             }
>         };
>     }
> }
> searcher.search(myQuery, collector);
> return collector.sum;
>
> The query is a moderately complicated Boolean query with some
> TermQuery and MultiTermQuery instances combined together.
> While first testing, I observed that seemingly the collector is called
> twice for each document, and the sum is exactly double what you would
> expect.
>
> It seems that the Collector is observing every matched document twice,
> and by printing out the Scorer, I see that it's done with two
> different BooleanScorer instances.
> You can see my hack that resets the collector every time it starts at
> docBase 0. which I am sure is not the right approach, but seems to
> work.
> What is the right pattern to ensure my Collector only observes result
> documents once, no matter the input query? I see a note in the
> documentation that state is supposed to be stored on the Scorer
> implementation, but I am not providing a custom Scorer, nor do I
> actually want any scoring at all.
>
> Thank you for any guidance!
> Steven
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien

Re: Querying into a Collector visits documents multiple times

Reply via email to