One more wrinkle for extremely large lists, is pass the list in as an InputStream which is a presorted binary representation of the ASIN's and slide a BytesRef across the stream and merge it with the SortedDocValues. This saves on all the object creation and String overhead for really long lists of id's.
Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joels...@gmail.com> wrote: > If the list of ASIN's is presorted you can quickly merge it with the > SortedDocValues and produce a FixedBitSet of the top level ordinals, which > can be used as the post filter. This is a nice approach for things like > passing in a long list of access control predicates. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpou...@gmail.com> wrote: > >> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these >> ideas. >> >> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcm...@gmail.com> wrote: >> >>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpou...@gmail.com> wrote: >>> > >>> > > And then we could make an IndexOrDocValuesQuery with both the >>> TermInSetQuery and this SDV.newSlowInSetQuery? >>> > >>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the >>> "index" query can evaluate its cost (ScorerSupplier#cost) without doing >>> anything costly, which isn't the case for TermInSetQuery. >>> > >>> > So we'd need to make some changes. Estimating the cost of a >>> TermInSetQuery in general without seeking the terms is a hard problem, but >>> maybe we could specialize the unique key case to return the number of terms >>> as the cost? >>> >>> Yes we know each term in terms dict only has a single document, when >>> terms.size() == terms.getSumDocFreq(): there's only one posting for >>> each term. >>> But we can probably generalize a cost estimation a bit more, just >>> based on these two stats? >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> >> -- >> Adrien >> >