But actually Patrick Zhai added support for nondeterministic regexes
that might help with cases like that? There is this in
TestRegexpQuery:
/** Test worst-case for getCommonSuffix optimization */
public void testSlowCommonSuffix() throws Exception {
expectThrows(
TooComplexToDeterminizeException.class,
() -> {
new RegexpQuery(new Term("stringvalue", "(.*a){2000}"));
});
}
On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov <[email protected]> wrote:
>
> Yes, I think degenerate regexes like *a* are potentially costly.
> Actually something like *Ⱗ* is probably worse since yeah it would need
> to scan the entire FST (which probably has some a's in it?)
>
> I don't see any way around that aside from: (1) telling user don't do
> that, or (2) putting some accounting on FST so it can early-terminate
>
> On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <[email protected]> wrote:
> >
> > Incidentally, speaking as someone with only a superficial understanding of
> > how the FSTs work, I'm wondering if there is risk of cost in expanding the
> > first few terms.
> >
> > Say we have a million terms, but only one contains an 'a'. If someone
> > searches for '*a*', does that devolve into a term scan? Or can the FST
> > efficiently identify an arc with an 'a' and efficiently identify terms
> > containing that arc?
> >
> > Thanks,
> > Froh
> >
> > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <[email protected]> wrote:
> >>
> >> Exactly!
> >>
> >> My initial implementation added some potential cost. (I think I enumerated
> >> up to 128 terms before giving up.) Now that Mayya moved the (probably
> >> tiny) cost of expanding the first 16 terms upfront, my change is
> >> theoretically "free".
> >>
> >> Froh
> >>
> >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <[email protected]> wrote:
> >>>
> >>> Hey Froh-
> >>>
> >>> I got some time to look through your PR (most of the time was actually
> >>> refreshing my memory on the change history leading up to your PR and
> >>> digesting the issue described). I think this makes a ton of sense. If I'm
> >>> understanding properly, the latest version of your PR essentially takes
> >>> advantage of Mayya's recent change
> >>> (https://github.com/apache/lucene/pull/13454) in the score supplier
> >>> behavior that is now doing _some_ up-front work to iterate the first <=
> >>> 16 terms when building the scoreSupplier and computes a more
> >>> accurate/reasonable cost based on that already-done work. Am I getting
> >>> this right? If so, this seems like it has no downsides and all upside.
> >>>
> >>> I'll do a proper pass through the PR here shortly, but I love the idea
> >>> (assuming I'm understanding it properly on a Friday afternoon after a
> >>> long-ish week...).
> >>>
> >>> Cheers,
> >>> -Greg
> >>>
> >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <[email protected]> wrote:
> >>>>
> >>>> Hi Froh-
> >>>>
> >>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in
> >>>> June (had some vacation and was generally away). I'd be interested to
> >>>> see what others think as well, but I'll at least commit to looking
> >>>> through your PR tomorrow or Monday to get a better handle on what's
> >>>> being proposed. We went through a few iterations of this originally
> >>>> before we landed on the current version. One promising approach was to
> >>>> have a more intelligent query that would load some number of terms
> >>>> up-front to get a better cost estimate before making a decision, but it
> >>>> required a custom query implementation that generally didn't get
> >>>> favorable feedback (it's nice to be able to use the existing
> >>>> IndexOrDocValuesQuery abstraction instead). I can dig up some of that
> >>>> conversation if it's helpful, but I'll better understand what you've got
> >>>> in mind first.
> >>>>
> >>>> Unwinding a bit though, I'm also in favor in general that we should be
> >>>> able to do a better job estimating cost here. I think the tricky part is
> >>>> how we go about doing that effectively. Thanks again for kicking off
> >>>> this thread!
> >>>>
> >>>> Cheers,
> >>>> -Greg
> >>>>
> >>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <[email protected]> wrote:
> >>>>>
> >>>>> Hi there,
> >>>>>
> >>>>> For a few months, some of us have been running into issues with the
> >>>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper.
> >>>>> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300)
> >>>>>
> >>>>> In https://github.com/apache/lucene/issues/13029, the problem was
> >>>>> raised in terms of queries not being cached, because the estimated cost
> >>>>> was too high.
> >>>>>
> >>>>> We've also run into problems in OpenSearch, since we started wrapping
> >>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated
> >>>>> cost estimate, so IndexOrDocValueQuery decides it should be a DV query,
> >>>>> even though the MTQ would really only match a handful of docs (and
> >>>>> should be lead iterator).
> >>>>>
> >>>>> I opened a PR back in March
> >>>>> (https://github.com/apache/lucene/pull/13201) to try to handle the case
> >>>>> where a MultiTermQuery matches a small number of terms. Since Mayya
> >>>>> pulled the rewrite logic that expands up to 16 terms (to rewrite as a
> >>>>> Boolean disjunction) earlier in the workflow (in
> >>>>> https://github.com/apache/lucene/pull/13454), we get the better cost
> >>>>> estimate for MTQs on few terms "for free".
> >>>>>
> >>>>> What do folks think?
> >>>>>
> >>>>> Thanks,
> >>>>> Froh
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]