Re: Querying into a Collector visits documents multiple times

Michael Sokolov Fri, 24 Sep 2021 03:52:44 -0700

Ah sorry never mind. Confused collector and collector manager

On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov <msoko...@gmail.com> wrote:


> Separate issue, but this collector is not going to work with concurrent
> search since the sum is not updated in a thread safe manner. Maybe you
> don't care, since you don't use a thread pool to execute your queries, but
> you probably should!
>
> On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Hi Steven,
>>
>> This collector looks correct to me. Resetting the counter to 0 on the
>> first
>> segment is indeed not necessary.
>>
>> We have plenty of collectors that are very similar to this one and we
>> never
>> observed any double-counting issue. I would suspect an issue in the code
>> that calls this collector. Maybe try to print the stack trace under the `
>> if (context.docBase == 0) {` check to see why your collector is being
>> called twice?
>>
>> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
>> stevenschlans...@gmail.com> wrote:
>>
>> > Hi Lucene users,
>> >
>> > I am developing a search application that needs to do some basic
>> > summary statistics. We use Lucene 8.9.0.
>> > To improve performance for e.g. summing a value across 10,000
>> > documents, we are using DocValues as columnar storage.
>> >
>> > In order to retrieve the DocValues without collecting all hits into a
>> > TopDocs, which we determined to cause a lot of memory pressure and
>> > consume much time, we are using the expert Collector query interface.
>> >
>> > Here's the code, simplified a bit for the list:
>> >
>> > final collector = new Collector() {
>> >     long sum = 0;
>> >
>> >     @Override
>> >     public ScoreMode scoreMode() {
>> >         return ScoreMode.COMPLETE_NO_SCORES;
>> >     }
>> >
>> >     @Override
>> >     public LeafCollector getLeafCollector(final LeafReaderContext
>> > context) throws IOException {
>> >          if (context.docBase == 0) {
>> >             sum = 0; // XXX: this should not be necessary?
>> >         }
>> >         final var subtotalValue =
>> > context.reader().getNumericDocValues("subtotal");
>> >         return new LeafCollector() {
>> >             @Override
>> >             public void setScorer(final Scorable scorer) throws
>> > IOException {
>> >             }
>> >
>> >             @Override
>> >             public void collect(final int doc) throws IOException {
>> >                     if (subtotalValue.docID() > doc ||
>> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
>> >                         return;
>> >                     }
>> >                     sum += subtotalValue.longValue();
>> >             }
>> >         };
>> >     }
>> > }
>> > searcher.search(myQuery, collector);
>> > return collector.sum;
>> >
>> > The query is a moderately complicated Boolean query with some
>> > TermQuery and MultiTermQuery instances combined together.
>> > While first testing, I observed that seemingly the collector is called
>> > twice for each document, and the sum is exactly double what you would
>> > expect.
>> >
>> > It seems that the Collector is observing every matched document twice,
>> > and by printing out the Scorer, I see that it's done with two
>> > different BooleanScorer instances.
>> > You can see my hack that resets the collector every time it starts at
>> > docBase 0. which I am sure is not the right approach, but seems to
>> > work.
>> > What is the right pattern to ensure my Collector only observes result
>> > documents once, no matter the input query? I see a note in the
>> > documentation that state is supposed to be stored on the Scorer
>> > implementation, but I am not providing a custom Scorer, nor do I
>> > actually want any scoring at all.
>> >
>> > Thank you for any guidance!
>> > Steven
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>> --
>> Adrien
>>
>

Re: Querying into a Collector visits documents multiple times

Reply via email to