Re: Slow DV equivalent of TermInSetQuery

Joel Bernstein Tue, 26 Oct 2021 15:10:40 -0700

There are times, particularly in ecommerce and access control, where speed
really matters. So, you build stuff that's really fast at query time, with
a tradeoff at commit time.



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 5:31 PM Robert Muir <rcm...@gmail.com> wrote:

> Sorry, I don't think there is a need to use any top-level ordinals.
> none of these docvalues-based query implementations need it.
>
> As far as query intersecting an input-stream, that is a big no-go.
> Lucene Queries need to have correct hashcode/equals/etc.
>
> That's why current stuff around this such as TermInSetQuery encode
> everything into a PrefixCodedTerms.
>
> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <joels...@gmail.com> wrote:
> >
> > One more wrinkle for extremely large lists, is pass the list in as an
> InputStream which is a presorted binary representation of the ASIN's and
> slide a BytesRef across the stream and merge it with the SortedDocValues.
> This saves on all the object creation and String overhead for really long
> lists of id's.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joels...@gmail.com>
> wrote:
> >>
> >> If the list of ASIN's is presorted you can quickly merge it with the
> SortedDocValues and produce a FixedBitSet of the top level ordinals, which
> can be used as the post filter. This is a nice approach for things like
> passing in a long list of access control predicates.
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >>
> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpou...@gmail.com> wrote:
> >>>
> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about
> these ideas.
> >>>
> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcm...@gmail.com> wrote:
> >>>>
> >>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpou...@gmail.com>
> wrote:
> >>>> >
> >>>> > > And then we could make an IndexOrDocValuesQuery with both the
> TermInSetQuery and this SDV.newSlowInSetQuery?
> >>>> >
> >>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the
> "index" query can evaluate its cost (ScorerSupplier#cost) without doing
> anything costly, which isn't the case for TermInSetQuery.
> >>>> >
> >>>> > So we'd need to make some changes. Estimating the cost of a
> TermInSetQuery in general without seeking the terms is a hard problem, but
> maybe we could specialize the unique key case to return the number of terms
> as the cost?
> >>>>
> >>>> Yes we know each term in terms dict only has a single document, when
> >>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
> >>>> each term.
> >>>> But we can probably generalize a cost estimation a bit more, just
> >>>> based on these two stats?
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>>
> >>>
> >>>
> >>> --
> >>> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Slow DV equivalent of TermInSetQuery

Reply via email to