But actually Patrick Zhai added support for nondeterministic regexes that might help with cases like that? There is this in TestRegexpQuery:
/** Test worst-case for getCommonSuffix optimization */ public void testSlowCommonSuffix() throws Exception { expectThrows( TooComplexToDeterminizeException.class, () -> { new RegexpQuery(new Term("stringvalue", "(.*a){2000}")); }); } On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov <msoko...@gmail.com> wrote: > > Yes, I think degenerate regexes like *a* are potentially costly. > Actually something like *Ⱗ* is probably worse since yeah it would need > to scan the entire FST (which probably has some a's in it?) > > I don't see any way around that aside from: (1) telling user don't do > that, or (2) putting some accounting on FST so it can early-terminate > > On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <msf...@gmail.com> wrote: > > > > Incidentally, speaking as someone with only a superficial understanding of > > how the FSTs work, I'm wondering if there is risk of cost in expanding the > > first few terms. > > > > Say we have a million terms, but only one contains an 'a'. If someone > > searches for '*a*', does that devolve into a term scan? Or can the FST > > efficiently identify an arc with an 'a' and efficiently identify terms > > containing that arc? > > > > Thanks, > > Froh > > > > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote: > >> > >> Exactly! > >> > >> My initial implementation added some potential cost. (I think I enumerated > >> up to 128 terms before giving up.) Now that Mayya moved the (probably > >> tiny) cost of expanding the first 16 terms upfront, my change is > >> theoretically "free". > >> > >> Froh > >> > >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote: > >>> > >>> Hey Froh- > >>> > >>> I got some time to look through your PR (most of the time was actually > >>> refreshing my memory on the change history leading up to your PR and > >>> digesting the issue described). I think this makes a ton of sense. If I'm > >>> understanding properly, the latest version of your PR essentially takes > >>> advantage of Mayya's recent change > >>> (https://github.com/apache/lucene/pull/13454) in the score supplier > >>> behavior that is now doing _some_ up-front work to iterate the first <= > >>> 16 terms when building the scoreSupplier and computes a more > >>> accurate/reasonable cost based on that already-done work. Am I getting > >>> this right? If so, this seems like it has no downsides and all upside. > >>> > >>> I'll do a proper pass through the PR here shortly, but I love the idea > >>> (assuming I'm understanding it properly on a Friday afternoon after a > >>> long-ish week...). > >>> > >>> Cheers, > >>> -Greg > >>> > >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote: > >>>> > >>>> Hi Froh- > >>>> > >>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in > >>>> June (had some vacation and was generally away). I'd be interested to > >>>> see what others think as well, but I'll at least commit to looking > >>>> through your PR tomorrow or Monday to get a better handle on what's > >>>> being proposed. We went through a few iterations of this originally > >>>> before we landed on the current version. One promising approach was to > >>>> have a more intelligent query that would load some number of terms > >>>> up-front to get a better cost estimate before making a decision, but it > >>>> required a custom query implementation that generally didn't get > >>>> favorable feedback (it's nice to be able to use the existing > >>>> IndexOrDocValuesQuery abstraction instead). I can dig up some of that > >>>> conversation if it's helpful, but I'll better understand what you've got > >>>> in mind first. > >>>> > >>>> Unwinding a bit though, I'm also in favor in general that we should be > >>>> able to do a better job estimating cost here. I think the tricky part is > >>>> how we go about doing that effectively. Thanks again for kicking off > >>>> this thread! > >>>> > >>>> Cheers, > >>>> -Greg > >>>> > >>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote: > >>>>> > >>>>> Hi there, > >>>>> > >>>>> For a few months, some of us have been running into issues with the > >>>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper. > >>>>> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300) > >>>>> > >>>>> In https://github.com/apache/lucene/issues/13029, the problem was > >>>>> raised in terms of queries not being cached, because the estimated cost > >>>>> was too high. > >>>>> > >>>>> We've also run into problems in OpenSearch, since we started wrapping > >>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated > >>>>> cost estimate, so IndexOrDocValueQuery decides it should be a DV query, > >>>>> even though the MTQ would really only match a handful of docs (and > >>>>> should be lead iterator). > >>>>> > >>>>> I opened a PR back in March > >>>>> (https://github.com/apache/lucene/pull/13201) to try to handle the case > >>>>> where a MultiTermQuery matches a small number of terms. Since Mayya > >>>>> pulled the rewrite logic that expands up to 16 terms (to rewrite as a > >>>>> Boolean disjunction) earlier in the workflow (in > >>>>> https://github.com/apache/lucene/pull/13454), we get the better cost > >>>>> estimate for MTQs on few terms "for free". > >>>>> > >>>>> What do folks think? > >>>>> > >>>>> Thanks, > >>>>> Froh --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org