13029)

Michael Sokolov Tue, 06 Aug 2024 08:01:05 -0700

But actually Patrick Zhai added support for nondeterministic regexes
that might help with cases like that?  There is this in
TestRegexpQuery:


 /** Test worst-case for getCommonSuffix optimization */
  public void testSlowCommonSuffix() throws Exception {
    expectThrows(
        TooComplexToDeterminizeException.class,
        () -> {
          new RegexpQuery(new Term("stringvalue", "(.*a){2000}"));
        });
  }

On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov <[email protected]> wrote:
>
> Yes, I think degenerate regexes like *a* are potentially costly.
> Actually something like *Ⱗ* is probably worse since yeah it would need
> to scan the entire FST (which probably has some a's in it?)
>
> I don't see any way around that aside from: (1) telling user don't do
> that, or (2) putting some accounting on FST so it can early-terminate
>
> On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <[email protected]> wrote:
> >
> > Incidentally, speaking as someone with only a superficial understanding of 
> > how the FSTs work, I'm wondering if there is risk of cost in expanding the 
> > first few terms.
> >
> > Say we have a million terms, but only one contains an 'a'. If someone 
> > searches for '*a*', does that devolve into a term scan? Or can the FST 
> > efficiently identify an arc with an 'a' and efficiently identify terms 
> > containing that arc?
> >
> > Thanks,
> > Froh
> >
> > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <[email protected]> wrote:
> >>
> >> Exactly!
> >>
> >> My initial implementation added some potential cost. (I think I enumerated 
> >> up to 128 terms before giving up.) Now that Mayya moved the (probably 
> >> tiny) cost of expanding the first 16 terms upfront, my change is 
> >> theoretically "free".
> >>
> >> Froh
> >>
> >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <[email protected]> wrote:
> >>>
> >>> Hey Froh-
> >>>
> >>> I got some time to look through your PR (most of the time was actually 
> >>> refreshing my memory on the change history leading up to your PR and 
> >>> digesting the issue described). I think this makes a ton of sense. If I'm 
> >>> understanding properly, the latest version of your PR essentially takes 
> >>> advantage of Mayya's recent change 
> >>> (https://github.com/apache/lucene/pull/13454) in the score supplier 
> >>> behavior that is now doing _some_ up-front work to iterate the first <= 
> >>> 16 terms when building the scoreSupplier and computes a more 
> >>> accurate/reasonable cost based on that already-done work. Am I getting 
> >>> this right? If so, this seems like it has no downsides and all upside.
> >>>
> >>> I'll do a proper pass through the PR here shortly, but I love the idea 
> >>> (assuming I'm understanding it properly on a Friday afternoon after a 
> >>> long-ish week...).
> >>>
> >>> Cheers,
> >>> -Greg
> >>>
> >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <[email protected]> wrote:
> >>>>
> >>>> Hi Froh-
> >>>>
> >>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in 
> >>>> June (had some vacation and was generally away). I'd be interested to 
> >>>> see what others think as well, but I'll at least commit to looking 
> >>>> through your PR tomorrow or Monday to get a better handle on what's 
> >>>> being proposed. We went through a few iterations of this originally 
> >>>> before we landed on the current version. One promising approach was to 
> >>>> have a more intelligent query that would load some number of terms 
> >>>> up-front to get a better cost estimate before making a decision, but it 
> >>>> required a custom query implementation that generally didn't get 
> >>>> favorable feedback (it's nice to be able to use the existing 
> >>>> IndexOrDocValuesQuery abstraction instead). I can dig up some of that 
> >>>> conversation if it's helpful, but I'll better understand what you've got 
> >>>> in mind first.
> >>>>
> >>>> Unwinding a bit though, I'm also in favor in general that we should be 
> >>>> able to do a better job estimating cost here. I think the tricky part is 
> >>>> how we go about doing that effectively. Thanks again for kicking off 
> >>>> this thread!
> >>>>
> >>>> Cheers,
> >>>> -Greg
> >>>>
> >>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <[email protected]> wrote:
> >>>>>
> >>>>> Hi there,
> >>>>>
> >>>>> For a few months, some of us have been running into issues with the 
> >>>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper. 
> >>>>> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300)
> >>>>>
> >>>>> In https://github.com/apache/lucene/issues/13029, the problem was 
> >>>>> raised in terms of queries not being cached, because the estimated cost 
> >>>>> was too high.
> >>>>>
> >>>>> We've also run into problems in OpenSearch, since we started wrapping 
> >>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated 
> >>>>> cost estimate, so IndexOrDocValueQuery decides it should be a DV query, 
> >>>>> even though the MTQ would really only match a handful of docs (and 
> >>>>> should be lead iterator).
> >>>>>
> >>>>> I opened a PR back in March 
> >>>>> (https://github.com/apache/lucene/pull/13201) to try to handle the case 
> >>>>> where a MultiTermQuery matches a small number of terms. Since Mayya 
> >>>>> pulled the rewrite logic that expands up to 16 terms (to rewrite as a 
> >>>>> Boolean disjunction) earlier in the workflow (in 
> >>>>> https://github.com/apache/lucene/pull/13454), we get the better cost 
> >>>>> estimate for MTQs on few terms "for free".
> >>>>>
> >>>>> What do folks think?
> >>>>>
> >>>>> Thanks,
> >>>>> Froh

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

Reply via email to