Yes, I think degenerate regexes like *a* are potentially costly.
Actually something like *Ⱗ* is probably worse since yeah it would need
to scan the entire FST (which probably has some a's in it?)

I don't see any way around that aside from: (1) telling user don't do
that, or (2) putting some accounting on FST so it can early-terminate

On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <msf...@gmail.com> wrote:
>
> Incidentally, speaking as someone with only a superficial understanding of 
> how the FSTs work, I'm wondering if there is risk of cost in expanding the 
> first few terms.
>
> Say we have a million terms, but only one contains an 'a'. If someone 
> searches for '*a*', does that devolve into a term scan? Or can the FST 
> efficiently identify an arc with an 'a' and efficiently identify terms 
> containing that arc?
>
> Thanks,
> Froh
>
> On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote:
>>
>> Exactly!
>>
>> My initial implementation added some potential cost. (I think I enumerated 
>> up to 128 terms before giving up.) Now that Mayya moved the (probably tiny) 
>> cost of expanding the first 16 terms upfront, my change is theoretically 
>> "free".
>>
>> Froh
>>
>> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> wrote:
>>>
>>> Hey Froh-
>>>
>>> I got some time to look through your PR (most of the time was actually 
>>> refreshing my memory on the change history leading up to your PR and 
>>> digesting the issue described). I think this makes a ton of sense. If I'm 
>>> understanding properly, the latest version of your PR essentially takes 
>>> advantage of Mayya's recent change 
>>> (https://github.com/apache/lucene/pull/13454) in the score supplier 
>>> behavior that is now doing _some_ up-front work to iterate the first <= 16 
>>> terms when building the scoreSupplier and computes a more 
>>> accurate/reasonable cost based on that already-done work. Am I getting this 
>>> right? If so, this seems like it has no downsides and all upside.
>>>
>>> I'll do a proper pass through the PR here shortly, but I love the idea 
>>> (assuming I'm understanding it properly on a Friday afternoon after a 
>>> long-ish week...).
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> wrote:
>>>>
>>>> Hi Froh-
>>>>
>>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in 
>>>> June (had some vacation and was generally away). I'd be interested to see 
>>>> what others think as well, but I'll at least commit to looking through 
>>>> your PR tomorrow or Monday to get a better handle on what's being 
>>>> proposed. We went through a few iterations of this originally before we 
>>>> landed on the current version. One promising approach was to have a more 
>>>> intelligent query that would load some number of terms up-front to get a 
>>>> better cost estimate before making a decision, but it required a custom 
>>>> query implementation that generally didn't get favorable feedback (it's 
>>>> nice to be able to use the existing IndexOrDocValuesQuery abstraction 
>>>> instead). I can dig up some of that conversation if it's helpful, but I'll 
>>>> better understand what you've got in mind first.
>>>>
>>>> Unwinding a bit though, I'm also in favor in general that we should be 
>>>> able to do a better job estimating cost here. I think the tricky part is 
>>>> how we go about doing that effectively. Thanks again for kicking off this 
>>>> thread!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> For a few months, some of us have been running into issues with the cost 
>>>>> estimate from AbstractMultiTermQueryConstantScoreWrapper. 
>>>>> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300)
>>>>>
>>>>> In https://github.com/apache/lucene/issues/13029, the problem was raised 
>>>>> in terms of queries not being cached, because the estimated cost was too 
>>>>> high.
>>>>>
>>>>> We've also run into problems in OpenSearch, since we started wrapping 
>>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated 
>>>>> cost estimate, so IndexOrDocValueQuery decides it should be a DV query, 
>>>>> even though the MTQ would really only match a handful of docs (and should 
>>>>> be lead iterator).
>>>>>
>>>>> I opened a PR back in March (https://github.com/apache/lucene/pull/13201) 
>>>>> to try to handle the case where a MultiTermQuery matches a small number 
>>>>> of terms. Since Mayya pulled the rewrite logic that expands up to 16 
>>>>> terms (to rewrite as a Boolean disjunction) earlier in the workflow (in 
>>>>> https://github.com/apache/lucene/pull/13454), we get the better cost 
>>>>> estimate for MTQs on few terms "for free".
>>>>>
>>>>> What do folks think?
>>>>>
>>>>> Thanks,
>>>>> Froh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to