[I] Should Lucene offer a helper class for primary key lookups? [lucene]

via GitHub Tue, 23 Dec 2025 14:38:31 -0800


mikemccand opened a new issue, #15520:
URL: https://github.com/apache/lucene/issues/15520

### Description

Primary key lookup is such a common use case for search engines -- Lucene's
nightly benchmarks track the performance of `PKLookup` (primary key lookup),
finding the one document that contains a unique term. It's a stress test of
Lucene's terms index/dictionary impl, and it has been [surprisingly volatile
over the past ~15 years](https://benchmarks.mikemccandless.com/PKLookup.html).

This task
([`PKLookupTask`](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java))
is somewhat bespoke: it first (on init -- not counted in the benchy stopwatch
timer) [creates a batch of 4000 randomly picked ids (all valid/existing --
maybe we should mix in some non-existing
ids?)](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java#L56-L62)
in the index. Then, when the concurrent benchy search threads execute this
task, they iterate through each id, reusing per-segment state (`TermsEnum`) and
use `seekExact` to see if the term exists in that segment. (Hmm, the ids are
not pre-sorted -- that would make benchy a bit faster maybe -- terms dict likes
it if you go forward only in your seeks. Also, the segments are visited in
whatever order the `DirectoryReader` has them in ... I think if it sorted by
descending `maxDoc` that also might make things faster -- best bang for your
buck ish? Anyway:). As soon as it finds a
live document for the id, it early terminates (doesn't check for more docs
from the `PostingsEnum`, nor any of the remaining segments).

I wonder if Lucene should offer a utility API, to look up one PK? It'd
early terminate when it found just the one doc. It might even be able to
optimize for cases where most lookups are not in the index, if the users
expresses that. The class would be reusable (hang onto the `TermsEnum` for
each leaf), and thread bound (well, only one thread can do the lookup at a
time, at least). Users today would normally just use a `TermQuery`, or maybe
`TermInSetQuery` for a batch of PK lookups, but those are missing the
optimizations to early terminate. It could be a `PrimaryKeyInSetQuery` or so?
Or maybe lower level `PrimaryKeyTermStates` for the single PK lookup case.

I suspect retrieving docs by their unique id is a common use case for
Lucene/OpenSearch/Solr/Elasticsearch users. At Amazon's customer facing
product search we have many queries / clauses / other things (doc version
tracking for out-of-order updates during indexing) that seem to do this, at
quite a bit aggregated cost, and we haven't optimized them well but thinking
about it. This is also related to my dev email about [visualizing bloom filter
tradeoffs](https://lists.apache.org/thread/f5sbhcpt7s6cmt318wfc6dvnc756f27n),
since in theory Lucene's `BloomFilteringPostingsFormat` should maybe help this
use case.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Should Lucene offer a helper class for primary key lookups? [lucene]

Reply via email to