Ah sorry never mind. Confused collector and collector manager On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov <msoko...@gmail.com> wrote:
> Separate issue, but this collector is not going to work with concurrent > search since the sum is not updated in a thread safe manner. Maybe you > don't care, since you don't use a thread pool to execute your queries, but > you probably should! > > On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpou...@gmail.com> wrote: > >> Hi Steven, >> >> This collector looks correct to me. Resetting the counter to 0 on the >> first >> segment is indeed not necessary. >> >> We have plenty of collectors that are very similar to this one and we >> never >> observed any double-counting issue. I would suspect an issue in the code >> that calls this collector. Maybe try to print the stack trace under the ` >> if (context.docBase == 0) {` check to see why your collector is being >> called twice? >> >> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker < >> stevenschlans...@gmail.com> wrote: >> >> > Hi Lucene users, >> > >> > I am developing a search application that needs to do some basic >> > summary statistics. We use Lucene 8.9.0. >> > To improve performance for e.g. summing a value across 10,000 >> > documents, we are using DocValues as columnar storage. >> > >> > In order to retrieve the DocValues without collecting all hits into a >> > TopDocs, which we determined to cause a lot of memory pressure and >> > consume much time, we are using the expert Collector query interface. >> > >> > Here's the code, simplified a bit for the list: >> > >> > final collector = new Collector() { >> > long sum = 0; >> > >> > @Override >> > public ScoreMode scoreMode() { >> > return ScoreMode.COMPLETE_NO_SCORES; >> > } >> > >> > @Override >> > public LeafCollector getLeafCollector(final LeafReaderContext >> > context) throws IOException { >> > if (context.docBase == 0) { >> > sum = 0; // XXX: this should not be necessary? >> > } >> > final var subtotalValue = >> > context.reader().getNumericDocValues("subtotal"); >> > return new LeafCollector() { >> > @Override >> > public void setScorer(final Scorable scorer) throws >> > IOException { >> > } >> > >> > @Override >> > public void collect(final int doc) throws IOException { >> > if (subtotalValue.docID() > doc || >> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) { >> > return; >> > } >> > sum += subtotalValue.longValue(); >> > } >> > }; >> > } >> > } >> > searcher.search(myQuery, collector); >> > return collector.sum; >> > >> > The query is a moderately complicated Boolean query with some >> > TermQuery and MultiTermQuery instances combined together. >> > While first testing, I observed that seemingly the collector is called >> > twice for each document, and the sum is exactly double what you would >> > expect. >> > >> > It seems that the Collector is observing every matched document twice, >> > and by printing out the Scorer, I see that it's done with two >> > different BooleanScorer instances. >> > You can see my hack that resets the collector every time it starts at >> > docBase 0. which I am sure is not the right approach, but seems to >> > work. >> > What is the right pattern to ensure my Collector only observes result >> > documents once, no matter the input query? I see a note in the >> > documentation that state is supposed to be stored on the Scorer >> > implementation, but I am not providing a custom Scorer, nor do I >> > actually want any scoring at all. >> > >> > Thank you for any guidance! >> > Steven >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> -- >> Adrien >> >