Yes, I think degenerate regexes like *a* are potentially costly. Actually something like *Ⱗ* is probably worse since yeah it would need to scan the entire FST (which probably has some a's in it?)
I don't see any way around that aside from: (1) telling user don't do that, or (2) putting some accounting on FST so it can early-terminate On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <msf...@gmail.com> wrote: > > Incidentally, speaking as someone with only a superficial understanding of > how the FSTs work, I'm wondering if there is risk of cost in expanding the > first few terms. > > Say we have a million terms, but only one contains an 'a'. If someone > searches for '*a*', does that devolve into a term scan? Or can the FST > efficiently identify an arc with an 'a' and efficiently identify terms > containing that arc? > > Thanks, > Froh > > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote: >> >> Exactly! >> >> My initial implementation added some potential cost. (I think I enumerated >> up to 128 terms before giving up.) Now that Mayya moved the (probably tiny) >> cost of expanding the first 16 terms upfront, my change is theoretically >> "free". >> >> Froh >> >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote: >>> >>> Hey Froh- >>> >>> I got some time to look through your PR (most of the time was actually >>> refreshing my memory on the change history leading up to your PR and >>> digesting the issue described). I think this makes a ton of sense. If I'm >>> understanding properly, the latest version of your PR essentially takes >>> advantage of Mayya's recent change >>> (https://github.com/apache/lucene/pull/13454) in the score supplier >>> behavior that is now doing _some_ up-front work to iterate the first <= 16 >>> terms when building the scoreSupplier and computes a more >>> accurate/reasonable cost based on that already-done work. Am I getting this >>> right? If so, this seems like it has no downsides and all upside. >>> >>> I'll do a proper pass through the PR here shortly, but I love the idea >>> (assuming I'm understanding it properly on a Friday afternoon after a >>> long-ish week...). >>> >>> Cheers, >>> -Greg >>> >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote: >>>> >>>> Hi Froh- >>>> >>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in >>>> June (had some vacation and was generally away). I'd be interested to see >>>> what others think as well, but I'll at least commit to looking through >>>> your PR tomorrow or Monday to get a better handle on what's being >>>> proposed. We went through a few iterations of this originally before we >>>> landed on the current version. One promising approach was to have a more >>>> intelligent query that would load some number of terms up-front to get a >>>> better cost estimate before making a decision, but it required a custom >>>> query implementation that generally didn't get favorable feedback (it's >>>> nice to be able to use the existing IndexOrDocValuesQuery abstraction >>>> instead). I can dig up some of that conversation if it's helpful, but I'll >>>> better understand what you've got in mind first. >>>> >>>> Unwinding a bit though, I'm also in favor in general that we should be >>>> able to do a better job estimating cost here. I think the tricky part is >>>> how we go about doing that effectively. Thanks again for kicking off this >>>> thread! >>>> >>>> Cheers, >>>> -Greg >>>> >>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote: >>>>> >>>>> Hi there, >>>>> >>>>> For a few months, some of us have been running into issues with the cost >>>>> estimate from AbstractMultiTermQueryConstantScoreWrapper. >>>>> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300) >>>>> >>>>> In https://github.com/apache/lucene/issues/13029, the problem was raised >>>>> in terms of queries not being cached, because the estimated cost was too >>>>> high. >>>>> >>>>> We've also run into problems in OpenSearch, since we started wrapping >>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated >>>>> cost estimate, so IndexOrDocValueQuery decides it should be a DV query, >>>>> even though the MTQ would really only match a handful of docs (and should >>>>> be lead iterator). >>>>> >>>>> I opened a PR back in March (https://github.com/apache/lucene/pull/13201) >>>>> to try to handle the case where a MultiTermQuery matches a small number >>>>> of terms. Since Mayya pulled the rewrite logic that expands up to 16 >>>>> terms (to rewrite as a Boolean disjunction) earlier in the workflow (in >>>>> https://github.com/apache/lucene/pull/13454), we get the better cost >>>>> estimate for MTQs on few terms "for free". >>>>> >>>>> What do folks think? >>>>> >>>>> Thanks, >>>>> Froh --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org