Sorry for the delay, I opened a PR for conjunctions at https://github.com/apache/lucene/pull/13904.
On Sat, Oct 5, 2024 at 12:44 AM Rui Wu <rui...@mongodb.com> wrote: > Hi Adrien, > We find that on Lucene 911: > 1. A MUST of two ConstantScoreQuery: > (+ConstantScore($meta/fieldNames:searchableField) > +ConstantScore($meta/fieldNames:_bus_key)) invokes the score 3.6M times. > The query latency is quite high. And the the Max Conjunction scorer shows > up in the flamegraph > <https://htmlpreview.github.io/?https://github.com/wurui90/scratch/blob/main/flamegraphs/exists-with-limit-200-lucene911-must.html> > . > > while > > 2. A FILTER of two ConstantScoreQuery: > #ConstantScore($meta/fieldNames:searchableField) > #ConstantScore($meta/fieldNames:_bus_key) invokes the score 1001 times.In > my mental model, the 1's query result is identical to 2 and 1 can be > optimized to 2. I wonder why doesn't Lucene internal does this > optimization?" > > On Lucene97, both queries invoke the score 1001 times. > > Thanks! > > On Fri, Sep 20, 2024 at 11:53 AM Rui Wu <rui...@mongodb.com> wrote: > >> Hi Adrien, >> >> You are right, the Max Conjunction scorer shows up in the flamegraph for >> 12 MUST clauses: >> https://htmlpreview.github.io/?https://github.com/wurui90/scratch/blob/main/flamegraphs/exists-with-limit-200-lucene911-must.html >> >> >> Thanks! >> >> On Fri, Sep 20, 2024 at 2:27 AM Adrien Grand <jpou...@gmail.com> wrote: >> >>> This suggests that BlockMaxConjunctionBulkScorer has a similar issue, >>> I'll look into it too. >>> >>> On Thu, Sep 19, 2024 at 2:48 AM Rui Wu <rui...@mongodb.com> wrote: >>> >>>> Hi Adrien, >>>> >>>> Thanks for your help and putting up a fix! >>>> >>>> Another experiment I did without your PR: if the 12 SHOULD clauses >>>> are changed to 12 MUST clauses, the problem is the same: it collects 3.6M >>>> docs on Lucene911 but 1001 docs on Lucene97. Does this data point align >>>> with how MaxScoreBulkScorer works? >>>> >>>> Thanks! >>>> >>>> On Wed, Sep 18, 2024 at 1:51 AM Adrien Grand <jpou...@gmail.com> wrote: >>>> >>>>> Thank you, this last comment was helpful and helped me understand the >>>>> problem. I opened a PR at https://github.com/apache/lucene/pull/13800. >>>>> >>>>> On Tue, Sep 17, 2024 at 7:45 PM Rui Wu <rui...@mongodb.com> wrote: >>>>> >>>>>> Another information is that, in Lucene97, this query (12 SHOULD >>>>>> clauses) collected 1001 results; while in Lucene911, this query (12 >>>>>> SHOULD >>>>>> clauses) collected all docs (3.6M collect count). >>>>>> >>>>>> In Lucene911, if the query has only one SHOULD clause, it collects >>>>>> 1001 results. If the query has multiple clauses, it collects 3.6M >>>>>> results. >>>>>> >>>>>> On Tue, Sep 17, 2024 at 9:09 AM Rui Wu <rui...@mongodb.com> wrote: >>>>>> >>>>>>> This query latency increased from 14.65 to 20.90ms. >>>>>>> >>>>>>> We use the `TopScoreDocCollector.createSharedManager(/*batchSize*/ >>>>>>> 101, /*searchAfterFieldDoc*/ null, /*hitsThreshold*/ 1000); ` >>>>>>> >>>>>>> Thanks a lot! >>>>>>> >>>>>>> On Tue, Sep 17, 2024 at 6:45 AM Adrien Grand <jpou...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Can you tell us how long this query used to take, and how long it >>>>>>>> takes now? >>>>>>>> Also are you using IndexSearcher's default total hit count >>>>>>>> threshold of 1,000, or are you passing a custom value to >>>>>>>> TopScoreDocCollectorManager? >>>>>>>> >>>>>>>> On Tue, Sep 17, 2024 at 10:14 AM Rui Wu <rui...@mongodb.com> wrote: >>>>>>>> >>>>>>>>> Hi Adrien, >>>>>>>>> >>>>>>>>> Thanks for looking into this! Here are more screenshots of the >>>>>>>>> flamegraph. The original flamegraph HTMLs have stack traces from our >>>>>>>>> app so >>>>>>>>> I don't share it here. >>>>>>>>> [image: Screenshot 2024-09-17 at 1.13.07 AM.png][image: >>>>>>>>> Screenshot 2024-09-17 at 1.12.01 AM.png] >>>>>>>>> >>>>>>>>> On Tue, Sep 17, 2024 at 1:00 AM Adrien Grand <jpou...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello Rui, >>>>>>>>>> >>>>>>>>>> We actually released a change that should make MaxScoreBulkScorer >>>>>>>>>> faster on dense disjunctions in 9.8: >>>>>>>>>> https://github.com/apache/lucene/pull/12444. Your benchmark case >>>>>>>>>> is quite specific though as all clauses match all docs and produce >>>>>>>>>> constant >>>>>>>>>> scores, so I would expect the scorer to quickly realize that it can >>>>>>>>>> skip >>>>>>>>>> all documents once it's scored the first k docs. This makes me >>>>>>>>>> wonder if it >>>>>>>>>> bottleneck on skipping blocks of documents rather than on scoring >>>>>>>>>> them. >>>>>>>>>> Would you be able to share your whole flame graph, it looks like it >>>>>>>>>> may be >>>>>>>>>> truncated a the top? >>>>>>>>>> >>>>>>>>>> On Mon, Sep 16, 2024 at 10:01 PM Rui Wu <rui...@mongodb.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Correction: The index has 3.6 million documents. >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 16, 2024 at 1:00 PM Rui Wu <rui...@mongodb.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Dear experts, >>>>>>>>>>>> >>>>>>>>>>>> In our Mongodb Atlas Search performance regression test between >>>>>>>>>>>> Lucene 9.7 and Lucene 9.11, we detect a 43% latency regression in >>>>>>>>>>>> this >>>>>>>>>>>> query shape: >>>>>>>>>>>> 12 SHOULD clause, and each clause matches all of the documents. >>>>>>>>>>>> Each should clause is wrapped in ConstantScoreQuery. >>>>>>>>>>>> >>>>>>>>>>>> The index has 3.6 documents, and every document is identical: >>>>>>>>>>>> Every document is {"path": ["1", "2", "3" ... "12"]} >>>>>>>>>>>> The query shape is a BooleanQuery of SHOULD "1", SHOULD "2", >>>>>>>>>>>> ... SHOULD "12". >>>>>>>>>>>> >>>>>>>>>>>> Our flamegraphs show that most of the time in search() is spent >>>>>>>>>>>> on the MaxScoreBulkScorer class: >>>>>>>>>>>> [image: image.png] >>>>>>>>>>>> >>>>>>>>>>>> We wonder if this extreme test case is expected to be slow on >>>>>>>>>>>> MaxScoreBulkScorer? >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>> >>>>>>>>>>>> Rui Wu >>>>>>>>>>>> Lead Engineer, MongoDB >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Adrien >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Adrien >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Adrien >>>>> >>>> >>> >>> -- >>> Adrien >>> >> -- Adrien