Thanks for sharing this work Govind. This is interesting stuff! My brief 2 cents: I suspect it's a somewhat atypical pattern to have queries that are dominated by large cardinality term-in-set disjunctions. So my initial reaction to this is that I suspect the majority of use-cases would prefer the memory efficient query representation. But that's just one opinion.
That said, it would be great to publish your benchmarking in a PR! Let's capture this testing and have a discussion. Even if the community decides to stay with the current approach, you might want to consider adding your version of this query in the sandbox module. That could provide a nice, community-supported option for a more CPU efficient version of this query that users could opt into. Cheers, -Greg On Thu, May 7, 2026 at 3:20 AM Govind Balaji via dev <[email protected]> wrote: > Hello Lucene devs, > > I'd like to surface a benchmark observation about TermInSetQuery's choice > of PrefixCodedTerms as the in-memory term storage, and am curious if the > community is aware of this RAM-CPU tradeoff at large number of terms, and > the reasoning behind. > > ## Background > Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of > TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[] > of offsets, plus a cached int hashCode. > LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale > recorded in the issue was "This will save ram and cleanup a lot of the > code." > > ## The workload where this shows up > - We run TermInSetQuery with a tens of thousands terms per query. > - The LRUQueryCache skips caching the query on most segments because > they are below its per-leaf size threshold. In our workloads, this is ~90% > of segments (cumulatively a small fraction of docs). > On those segments the query pays its full per-segment cost of > unpacking and iterating the same prefix coded terms on every request. > - Our workloads are CPU-bound than memory-bound. > > In this setup, the work that is independent of segment size becomes the > dominant CPU cost of the query on all those many small segments. > PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as > hot in profiles. > > ## What we tried > A reimplementation of TermInSetQuery whose only change is the storage > shape of the term list. > Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by > random access into a BytesRef[] instead of a streaming TermIterator. > Conceptually this is closer to the pre-LUCENE-6350 storage layout. We also > avoid the packTerms() cost in the constructor with this. > > We additionally keep the term bytes in a single contiguous byte[] > (length-prefixed VInts, BytesRef[] entries are zero-copy views into it), > and have equals/hashCode operate solely on that byte[]. > That is how the pre-6350 implementation got its fast equals as well. > BytesRef[] is still redundant, but did not want to mess around with the > interfacing types too much and seemed okay. > > ## Results > JMH microbenchmarks against Lucene's TermInSetQuery (Query construction + > IndexSearcher.count(query)), on an in-memory index with the query cache > disabled. > We try a few different N terms in the TermInSetQuery, with each term being > a random 16 char hex. > For the benchmark, the index contains 50 segments and we vary what terms > are indexed in each segment too as follows. > Each indexed term is also a 16 char hex, and we index 1 doc for each term > that we chose to have indexed. Reported as speedup = TermInSetQuery time / > our impl time; >1 means our impl is faster. > > Index content | N = > 300 | N = 3k | N = 30k > each segment has exactly the same terms in the random query | 1.30x > | 1.31x | 1.22x > Only 2 of N query terms in each segment, most seeks miss | 2.11x > | 3.41x | 3.74x > 50k random (independent of the query) terms in each segment | 1.05x > | 1.15x | 1.32x > > In production (opensearch indexes with multiple shards per index), where > this TermInSetQuery is only a subquery of our typical query, > we saw the overall mean latency decrease by ~22% with this change when the > number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries with > number of terms between 1k to 5k). > > ## What I'd like to ask > Is this pattern of queries considered atypical and not worth the CPU at > the cost of memory? Is this also a guiding principle in other parts of > lucene? > Or is there a quantified typical values for CPU-memory tradeoff? Asking > out of curiousity whether we should be tweaking more stuff similarly, > internally, for our CPU-bound workloads. > > Happy to put a PR up with the JMH benchmark/the impl we tried if that's of > interest. > > Thanks, > Govind Balaji S > >
