Re: Slow DV equivalent of TermInSetQuery

Robert Muir Tue, 26 Oct 2021 14:31:02 -0700

Sorry, I don't think there is a need to use any top-level ordinals.
none of these docvalues-based query implementations need it.


As far as query intersecting an input-stream, that is a big no-go.
Lucene Queries need to have correct hashcode/equals/etc.

That's why current stuff around this such as TermInSetQuery encode
everything into a PrefixCodedTerms.

On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> One more wrinkle for extremely large lists, is pass the list in as an 
> InputStream which is a presorted binary representation of the ASIN's and 
> slide a BytesRef across the stream and merge it with the SortedDocValues. 
> This saves on all the object creation and String overhead for really long 
> lists of id's.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joels...@gmail.com> wrote:
>>
>> If the list of ASIN's is presorted you can quickly merge it with the 
>> SortedDocValues and produce a FixedBitSet of the top level ordinals, which 
>> can be used as the post filter. This is a nice approach for things like 
>> passing in a long list of access control predicates.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpou...@gmail.com> wrote:
>>>
>>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these 
>>> ideas.
>>>
>>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcm...@gmail.com> wrote:
>>>>
>>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpou...@gmail.com> wrote:
>>>> >
>>>> > > And then we could make an IndexOrDocValuesQuery with both the 
>>>> > > TermInSetQuery and this SDV.newSlowInSetQuery?
>>>> >
>>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" 
>>>> > query can evaluate its cost (ScorerSupplier#cost) without doing anything 
>>>> > costly, which isn't the case for TermInSetQuery.
>>>> >
>>>> > So we'd need to make some changes. Estimating the cost of a 
>>>> > TermInSetQuery in general without seeking the terms is a hard problem, 
>>>> > but maybe we could specialize the unique key case to return the number 
>>>> > of terms as the cost?
>>>>
>>>> Yes we know each term in terms dict only has a single document, when
>>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>>>> each term.
>>>> But we can probably generalize a cost estimation a bit more, just
>>>> based on these two stats?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>
>>>
>>> --
>>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Slow DV equivalent of TermInSetQuery

Reply via email to