Incidentally, speaking as someone with only a superficial understanding of
how the FSTs work, I'm wondering if there is risk of cost in expanding the
first few terms.

Say we have a million terms, but only one contains an 'a'. If someone
searches for '*a*', does that devolve into a term scan? Or can the FST
efficiently identify an arc with an 'a' and efficiently identify terms
containing that arc?

Thanks,
Froh

On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote:

> Exactly!
>
> My initial implementation added some potential cost. (I think I enumerated
> up to 128 terms before giving up.) Now that Mayya moved the (probably tiny)
> cost of expanding the first 16 terms upfront, my change is theoretically
> "free".
>
> Froh
>
> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote:
>
>> Hey Froh-
>>
>> I got some time to look through your PR (most of the time was actually
>> refreshing my memory on the change history leading up to your PR and
>> digesting the issue described). I think this makes a ton of sense. If I'm
>> understanding properly, the latest version of your PR essentially takes
>> advantage of Mayya's recent change (
>> https://github.com/apache/lucene/pull/13454) in the score supplier
>> behavior that is now doing _some_ up-front work to iterate the first <= 16
>> terms when building the scoreSupplier and computes a more
>> accurate/reasonable cost based on that already-done work. Am I getting this
>> right? If so, this seems like it has no downsides and all upside.
>>
>> I'll do a proper pass through the PR here shortly, but I love the idea
>> (assuming I'm understanding it properly on a Friday afternoon after a
>> long-ish week...).
>>
>> Cheers,
>> -Greg
>>
>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote:
>>
>>> Hi Froh-
>>>
>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in
>>> June (had some vacation and was generally away). I'd be interested to see
>>> what others think as well, but I'll at least commit to looking through your
>>> PR tomorrow or Monday to get a better handle on what's being proposed. We
>>> went through a few iterations of this originally before we landed on the
>>> current version. One promising approach was to have a more intelligent
>>> query that would load some number of terms up-front to get a better cost
>>> estimate before making a decision, but it required a custom query
>>> implementation that generally didn't get favorable feedback (it's nice to
>>> be able to use the existing IndexOrDocValuesQuery abstraction instead). I
>>> can dig up some of that conversation if it's helpful, but I'll better
>>> understand what you've got in mind first.
>>>
>>> Unwinding a bit though, I'm also in favor in general that we should be
>>> able to do a better job estimating cost here. I think the tricky part is
>>> how we go about doing that effectively. Thanks again for kicking off this
>>> thread!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote:
>>>
>>>> Hi there,
>>>>
>>>> For a few months, some of us have been running into issues with the
>>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper. (
>>>> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300
>>>> )
>>>>
>>>> In https://github.com/apache/lucene/issues/13029, the problem was
>>>> raised in terms of queries not being cached, because the estimated cost was
>>>> too high.
>>>>
>>>> We've also run into problems in OpenSearch, since we started wrapping
>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost
>>>> estimate, so IndexOrDocValueQuery decides it should be a DV query, even
>>>> though the MTQ would really only match a handful of docs (and should be
>>>> lead iterator).
>>>>
>>>> I opened a PR back in March (
>>>> https://github.com/apache/lucene/pull/13201) to try to handle the case
>>>> where a MultiTermQuery matches a small number of terms. Since Mayya pulled
>>>> the rewrite logic that expands up to 16 terms (to rewrite as a Boolean
>>>> disjunction) earlier in the workflow (in
>>>> https://github.com/apache/lucene/pull/13454), we get the better cost
>>>> estimate for MTQs on few terms "for free".
>>>>
>>>> What do folks think?
>>>>
>>>> Thanks,
>>>> Froh
>>>>
>>>

Reply via email to