Exactly! My initial implementation added some potential cost. (I think I enumerated up to 128 terms before giving up.) Now that Mayya moved the (probably tiny) cost of expanding the first 16 terms upfront, my change is theoretically "free".
Froh On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote: > Hey Froh- > > I got some time to look through your PR (most of the time was actually > refreshing my memory on the change history leading up to your PR and > digesting the issue described). I think this makes a ton of sense. If I'm > understanding properly, the latest version of your PR essentially takes > advantage of Mayya's recent change ( > https://github.com/apache/lucene/pull/13454) in the score supplier > behavior that is now doing _some_ up-front work to iterate the first <= 16 > terms when building the scoreSupplier and computes a more > accurate/reasonable cost based on that already-done work. Am I getting this > right? If so, this seems like it has no downsides and all upside. > > I'll do a proper pass through the PR here shortly, but I love the idea > (assuming I'm understanding it properly on a Friday afternoon after a > long-ish week...). > > Cheers, > -Greg > > On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote: > >> Hi Froh- >> >> Thanks for raising this and sorry I missed your tag in GH#13201 back in >> June (had some vacation and was generally away). I'd be interested to see >> what others think as well, but I'll at least commit to looking through your >> PR tomorrow or Monday to get a better handle on what's being proposed. We >> went through a few iterations of this originally before we landed on the >> current version. One promising approach was to have a more intelligent >> query that would load some number of terms up-front to get a better cost >> estimate before making a decision, but it required a custom query >> implementation that generally didn't get favorable feedback (it's nice to >> be able to use the existing IndexOrDocValuesQuery abstraction instead). I >> can dig up some of that conversation if it's helpful, but I'll better >> understand what you've got in mind first. >> >> Unwinding a bit though, I'm also in favor in general that we should be >> able to do a better job estimating cost here. I think the tricky part is >> how we go about doing that effectively. Thanks again for kicking off this >> thread! >> >> Cheers, >> -Greg >> >> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote: >> >>> Hi there, >>> >>> For a few months, some of us have been running into issues with the cost >>> estimate from AbstractMultiTermQueryConstantScoreWrapper. ( >>> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300 >>> ) >>> >>> In https://github.com/apache/lucene/issues/13029, the problem was >>> raised in terms of queries not being cached, because the estimated cost was >>> too high. >>> >>> We've also run into problems in OpenSearch, since we started wrapping >>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost >>> estimate, so IndexOrDocValueQuery decides it should be a DV query, even >>> though the MTQ would really only match a handful of docs (and should be >>> lead iterator). >>> >>> I opened a PR back in March (https://github.com/apache/lucene/pull/13201) >>> to try to handle the case where a MultiTermQuery matches a small number of >>> terms. Since Mayya pulled the rewrite logic that expands up to 16 terms (to >>> rewrite as a Boolean disjunction) earlier in the workflow (in >>> https://github.com/apache/lucene/pull/13454), we get the better cost >>> estimate for MTQs on few terms "for free". >>> >>> What do folks think? >>> >>> Thanks, >>> Froh >>> >>