Hi Greg, Thanks for looking into this. PR here with the new impl + JMH benchmark: https://github.com/apache/lucene/pull/16051. The JMH numbers are in the PR description.
Regards, Govind Balaji S On Fri, May 8, 2026 at 9:30 PM Greg Miller <[email protected]> wrote: > Thanks for sharing this work Govind. This is interesting stuff! > > My brief 2 cents: I suspect it's a somewhat atypical pattern to have > queries that are dominated by large cardinality term-in-set disjunctions. > So my initial reaction to this is that I suspect the majority of use-cases > would prefer the memory efficient query representation. But that's just one > opinion. > > That said, it would be great to publish your benchmarking in a PR! Let's > capture this testing and have a discussion. Even if the community decides > to stay with the current approach, you might want to consider adding your > version of this query in the sandbox module. That could provide a nice, > community-supported option for a more CPU efficient version of this query > that users could opt into. > > Cheers, > -Greg > > On Thu, May 7, 2026 at 3:20 AM Govind Balaji via dev < > [email protected]> wrote: > >> Hello Lucene devs, >> >> I'd like to surface a benchmark observation about TermInSetQuery's choice >> of PrefixCodedTerms as the in-memory term storage, and am curious if the >> community is aware of this RAM-CPU tradeoff at large number of terms, and >> the reasoning behind. >> >> ## Background >> Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of >> TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[] >> of offsets, plus a cached int hashCode. >> LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale >> recorded in the issue was "This will save ram and cleanup a lot of the >> code." >> >> ## The workload where this shows up >> - We run TermInSetQuery with a tens of thousands terms per query. >> - The LRUQueryCache skips caching the query on most segments because >> they are below its per-leaf size threshold. In our workloads, this is ~90% >> of segments (cumulatively a small fraction of docs). >> On those segments the query pays its full per-segment cost of >> unpacking and iterating the same prefix coded terms on every request. >> - Our workloads are CPU-bound than memory-bound. >> >> In this setup, the work that is independent of segment size becomes the >> dominant CPU cost of the query on all those many small segments. >> PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as >> hot in profiles. >> >> ## What we tried >> A reimplementation of TermInSetQuery whose only change is the storage >> shape of the term list. >> Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by >> random access into a BytesRef[] instead of a streaming TermIterator. >> Conceptually this is closer to the pre-LUCENE-6350 storage layout. We >> also avoid the packTerms() cost in the constructor with this. >> >> We additionally keep the term bytes in a single contiguous byte[] >> (length-prefixed VInts, BytesRef[] entries are zero-copy views into it), >> and have equals/hashCode operate solely on that byte[]. >> That is how the pre-6350 implementation got its fast equals as well. >> BytesRef[] is still redundant, but did not want to mess around with the >> interfacing types too much and seemed okay. >> >> ## Results >> JMH microbenchmarks against Lucene's TermInSetQuery (Query construction + >> IndexSearcher.count(query)), on an in-memory index with the query cache >> disabled. >> We try a few different N terms in the TermInSetQuery, with each term >> being a random 16 char hex. >> For the benchmark, the index contains 50 segments and we vary what terms >> are indexed in each segment too as follows. >> Each indexed term is also a 16 char hex, and we index 1 doc for each term >> that we chose to have indexed. Reported as speedup = TermInSetQuery time / >> our impl time; >1 means our impl is faster. >> >> Index content | N = >> 300 | N = 3k | N = 30k >> each segment has exactly the same terms in the random query | >> 1.30x | 1.31x | 1.22x >> Only 2 of N query terms in each segment, most seeks miss | >> 2.11x | 3.41x | 3.74x >> 50k random (independent of the query) terms in each segment | >> 1.05x | 1.15x | 1.32x >> >> In production (opensearch indexes with multiple shards per index), where >> this TermInSetQuery is only a subquery of our typical query, >> we saw the overall mean latency decrease by ~22% with this change when >> the number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries >> with number of terms between 1k to 5k). >> >> ## What I'd like to ask >> Is this pattern of queries considered atypical and not worth the CPU at >> the cost of memory? Is this also a guiding principle in other parts of >> lucene? >> Or is there a quantified typical values for CPU-memory tradeoff? Asking >> out of curiousity whether we should be tweaking more stuff similarly, >> internally, for our CPU-bound workloads. >> >> Happy to put a PR up with the JMH benchmark/the impl we tried if that's >> of interest. >> >> Thanks, >> Govind Balaji S >> >>
