Incidentally, speaking as someone with only a superficial understanding of how the FSTs work, I'm wondering if there is risk of cost in expanding the first few terms.
Say we have a million terms, but only one contains an 'a'. If someone searches for '*a*', does that devolve into a term scan? Or can the FST efficiently identify an arc with an 'a' and efficiently identify terms containing that arc? Thanks, Froh On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote: > Exactly! > > My initial implementation added some potential cost. (I think I enumerated > up to 128 terms before giving up.) Now that Mayya moved the (probably tiny) > cost of expanding the first 16 terms upfront, my change is theoretically > "free". > > Froh > > On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote: > >> Hey Froh- >> >> I got some time to look through your PR (most of the time was actually >> refreshing my memory on the change history leading up to your PR and >> digesting the issue described). I think this makes a ton of sense. If I'm >> understanding properly, the latest version of your PR essentially takes >> advantage of Mayya's recent change ( >> https://github.com/apache/lucene/pull/13454) in the score supplier >> behavior that is now doing _some_ up-front work to iterate the first <= 16 >> terms when building the scoreSupplier and computes a more >> accurate/reasonable cost based on that already-done work. Am I getting this >> right? If so, this seems like it has no downsides and all upside. >> >> I'll do a proper pass through the PR here shortly, but I love the idea >> (assuming I'm understanding it properly on a Friday afternoon after a >> long-ish week...). >> >> Cheers, >> -Greg >> >> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote: >> >>> Hi Froh- >>> >>> Thanks for raising this and sorry I missed your tag in GH#13201 back in >>> June (had some vacation and was generally away). I'd be interested to see >>> what others think as well, but I'll at least commit to looking through your >>> PR tomorrow or Monday to get a better handle on what's being proposed. We >>> went through a few iterations of this originally before we landed on the >>> current version. One promising approach was to have a more intelligent >>> query that would load some number of terms up-front to get a better cost >>> estimate before making a decision, but it required a custom query >>> implementation that generally didn't get favorable feedback (it's nice to >>> be able to use the existing IndexOrDocValuesQuery abstraction instead). I >>> can dig up some of that conversation if it's helpful, but I'll better >>> understand what you've got in mind first. >>> >>> Unwinding a bit though, I'm also in favor in general that we should be >>> able to do a better job estimating cost here. I think the tricky part is >>> how we go about doing that effectively. Thanks again for kicking off this >>> thread! >>> >>> Cheers, >>> -Greg >>> >>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote: >>> >>>> Hi there, >>>> >>>> For a few months, some of us have been running into issues with the >>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper. ( >>>> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300 >>>> ) >>>> >>>> In https://github.com/apache/lucene/issues/13029, the problem was >>>> raised in terms of queries not being cached, because the estimated cost was >>>> too high. >>>> >>>> We've also run into problems in OpenSearch, since we started wrapping >>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost >>>> estimate, so IndexOrDocValueQuery decides it should be a DV query, even >>>> though the MTQ would really only match a handful of docs (and should be >>>> lead iterator). >>>> >>>> I opened a PR back in March ( >>>> https://github.com/apache/lucene/pull/13201) to try to handle the case >>>> where a MultiTermQuery matches a small number of terms. Since Mayya pulled >>>> the rewrite logic that expands up to 16 terms (to rewrite as a Boolean >>>> disjunction) earlier in the workflow (in >>>> https://github.com/apache/lucene/pull/13454), we get the better cost >>>> estimate for MTQs on few terms "for free". >>>> >>>> What do folks think? >>>> >>>> Thanks, >>>> Froh >>>> >>>