mikemccand opened a new issue, #15520:
URL: https://github.com/apache/lucene/issues/15520

   ### Description
   
   Primary key lookup is such a common use case for search engines -- Lucene's 
nightly benchmarks track the performance of `PKLookup` (primary key lookup), 
finding the one document that contains a unique term.  It's a stress test of 
Lucene's terms index/dictionary impl, and it has been [surprisingly volatile 
over the past ~15 years](https://benchmarks.mikemccandless.com/PKLookup.html).
   
   This task 
([`PKLookupTask`](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java))
 is somewhat bespoke: it first (on init -- not counted in the benchy stopwatch 
timer) [creates a batch of 4000 randomly picked ids (all valid/existing -- 
maybe we should mix in some non-existing 
ids?)](https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/PKLookupTask.java#L56-L62)
 in the index.   Then, when the concurrent benchy search threads execute this 
task, they iterate through each id, reusing per-segment state (`TermsEnum`) and 
use `seekExact` to see if the term exists in that segment.  (Hmm, the ids are 
not pre-sorted -- that would make benchy a bit faster maybe -- terms dict likes 
it if you go forward only in your seeks.  Also, the segments are visited in 
whatever order the `DirectoryReader` has them in ... I think if it sorted by 
descending `maxDoc` that also might make things faster -- best bang for your 
buck ish?  Anyway:).  As soon as it finds a 
 live document for the id, it early terminates (doesn't check for more docs 
from the `PostingsEnum`, nor any of the remaining segments).
   
   I wonder if Lucene should offer a utility API, to look up one PK?  It'd 
early terminate when it found just the one doc.  It might even be able to 
optimize for cases where most lookups are not in the index, if the users 
expresses that.  The class would be reusable (hang onto the `TermsEnum` for 
each leaf), and thread bound (well, only one thread can do the lookup at a 
time, at least).  Users today would normally just use a `TermQuery`, or maybe 
`TermInSetQuery` for a batch of PK lookups, but those are missing the 
optimizations to early terminate.  It could be a `PrimaryKeyInSetQuery` or so?  
Or maybe lower level `PrimaryKeyTermStates` for the single PK lookup case.
   
   I suspect retrieving docs by their unique id is a common use case for 
Lucene/OpenSearch/Solr/Elasticsearch users.  At Amazon's customer facing 
product search we have many queries / clauses / other things (doc version 
tracking for out-of-order updates during indexing) that seem to do this, at 
quite a bit aggregated cost, and we haven't optimized them well but thinking 
about it.  This is also related to my dev email about [visualizing bloom filter 
tradeoffs](https://lists.apache.org/thread/f5sbhcpt7s6cmt318wfc6dvnc756f27n), 
since in theory Lucene's `BloomFilteringPostingsFormat` should maybe help this 
use case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to