Re: Querying into a Collector visits documents multiple times

Michael Sokolov Fri, 24 Sep 2021 03:51:47 -0700

Separate issue, but this collector is not going to work with concurrent
search since the sum is not updated in a thread safe manner. Maybe you
don't care, since you don't use a thread pool to execute your queries, but
you probably should!


On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <[email protected]> wrote:

> Hi Steven,
>
> This collector looks correct to me. Resetting the counter to 0 on the first
> segment is indeed not necessary.
>
> We have plenty of collectors that are very similar to this one and we never
> observed any double-counting issue. I would suspect an issue in the code
> that calls this collector. Maybe try to print the stack trace under the `
> if (context.docBase == 0) {` check to see why your collector is being
> called twice?
>
> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
> [email protected]> wrote:
>
> > Hi Lucene users,
> >
> > I am developing a search application that needs to do some basic
> > summary statistics. We use Lucene 8.9.0.
> > To improve performance for e.g. summing a value across 10,000
> > documents, we are using DocValues as columnar storage.
> >
> > In order to retrieve the DocValues without collecting all hits into a
> > TopDocs, which we determined to cause a lot of memory pressure and
> > consume much time, we are using the expert Collector query interface.
> >
> > Here's the code, simplified a bit for the list:
> >
> > final collector = new Collector() {
> >     long sum = 0;
> >
> >     @Override
> >     public ScoreMode scoreMode() {
> >         return ScoreMode.COMPLETE_NO_SCORES;
> >     }
> >
> >     @Override
> >     public LeafCollector getLeafCollector(final LeafReaderContext
> > context) throws IOException {
> >          if (context.docBase == 0) {
> >             sum = 0; // XXX: this should not be necessary?
> >         }
> >         final var subtotalValue =
> > context.reader().getNumericDocValues("subtotal");
> >         return new LeafCollector() {
> >             @Override
> >             public void setScorer(final Scorable scorer) throws
> > IOException {
> >             }
> >
> >             @Override
> >             public void collect(final int doc) throws IOException {
> >                     if (subtotalValue.docID() > doc ||
> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
> >                         return;
> >                     }
> >                     sum += subtotalValue.longValue();
> >             }
> >         };
> >     }
> > }
> > searcher.search(myQuery, collector);
> > return collector.sum;
> >
> > The query is a moderately complicated Boolean query with some
> > TermQuery and MultiTermQuery instances combined together.
> > While first testing, I observed that seemingly the collector is called
> > twice for each document, and the sum is exactly double what you would
> > expect.
> >
> > It seems that the Collector is observing every matched document twice,
> > and by printing out the Scorer, I see that it's done with two
> > different BooleanScorer instances.
> > You can see my hack that resets the collector every time it starts at
> > docBase 0. which I am sure is not the right approach, but seems to
> > work.
> > What is the right pattern to ensure my Collector only observes result
> > documents once, no matter the input query? I see a note in the
> > documentation that state is supposed to be stored on the Scorer
> > implementation, but I am not providing a custom Scorer, nor do I
> > actually want any scoring at all.
> >
> > Thank you for any guidance!
> > Steven
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> --
> Adrien
>

Re: Querying into a Collector visits documents multiple times

Reply via email to