Thanks for sharing this work Govind. This is interesting stuff!

My brief 2 cents: I suspect it's a somewhat atypical pattern to have
queries that are dominated by large cardinality term-in-set disjunctions.
So my initial reaction to this is that I suspect the majority of use-cases
would prefer the memory efficient query representation. But that's just one
opinion.

That said, it would be great to publish your benchmarking in a PR! Let's
capture this testing and have a discussion. Even if the community decides
to stay with the current approach, you might want to consider adding your
version of this query in the sandbox module. That could provide a nice,
community-supported option for a more CPU efficient version of this query
that users could opt into.

Cheers,
-Greg

On Thu, May 7, 2026 at 3:20 AM Govind Balaji via dev <[email protected]>
wrote:

> Hello Lucene devs,
>
> I'd like to surface a benchmark observation about TermInSetQuery's choice
> of PrefixCodedTerms as the in-memory term storage, and am curious if the
> community is aware of this RAM-CPU tradeoff at large number of terms, and
> the reasoning behind.
>
> ## Background
> Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of
> TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[]
> of offsets, plus a cached int hashCode.
> LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale
> recorded in the issue was "This will save ram and cleanup a lot of the
> code."
>
> ## The workload where this shows up
>   - We run TermInSetQuery with a tens of thousands terms per query.
>   - The LRUQueryCache skips caching the query on most segments because
> they are below its per-leaf size threshold. In our workloads, this is ~90%
> of segments (cumulatively a small fraction of docs).
>     On those segments the query pays its full per-segment cost of
> unpacking and iterating the same prefix coded terms on every request.
>   - Our workloads are CPU-bound than memory-bound.
>
> In this setup, the work that is independent of segment size becomes the
> dominant CPU cost of the query on all those many small segments.
> PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as
> hot in profiles.
>
> ## What we tried
> A reimplementation of TermInSetQuery whose only change is the storage
> shape of the term list.
> Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by
> random access into a BytesRef[] instead of a streaming TermIterator.
> Conceptually this is closer to the pre-LUCENE-6350 storage layout. We also
> avoid the packTerms() cost in the constructor with this.
>
> We additionally keep the term bytes in a single contiguous byte[]
> (length-prefixed VInts, BytesRef[] entries are zero-copy views into it),
> and have equals/hashCode operate solely on that byte[].
> That is how the pre-6350 implementation got its fast equals as well.
> BytesRef[] is still redundant, but did not want to mess around with the
> interfacing types too much and seemed okay.
>
> ## Results
> JMH microbenchmarks against Lucene's TermInSetQuery (Query construction +
> IndexSearcher.count(query)), on an in-memory index with the query cache
> disabled.
> We try a few different N terms in the TermInSetQuery, with each term being
> a random 16 char hex.
> For the benchmark, the index contains 50 segments and we vary what terms
> are indexed in each segment too as follows.
> Each indexed term is also a 16 char hex, and we index 1 doc for each term
> that we chose to have indexed. Reported as speedup = TermInSetQuery time /
> our impl time; >1 means our impl is faster.
>
>   Index content                                                   | N =
> 300 | N = 3k |  N = 30k
>   each segment has exactly the same terms in the random query     | 1.30x
>  | 1.31x  |  1.22x
>   Only 2 of N query terms in each segment, most seeks miss        | 2.11x
>  | 3.41x  |  3.74x
>   50k random (independent of the query) terms in each segment     | 1.05x
>  | 1.15x  |  1.32x
>
> In production (opensearch indexes with multiple shards per index), where
> this TermInSetQuery is only a subquery of our typical query,
> we saw the overall mean latency decrease by ~22% with this change when the
> number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries with
> number of terms between 1k to 5k).
>
> ## What I'd like to ask
> Is this pattern of queries considered atypical and not worth the CPU at
> the cost of memory? Is this also a guiding principle in other parts of
> lucene?
> Or is there a quantified typical values for CPU-memory tradeoff? Asking
> out of curiousity whether we should be tweaking more stuff similarly,
> internally, for our CPU-bound workloads.
>
> Happy to put a PR up with the JMH benchmark/the impl we tried if that's of
> interest.
>
> Thanks,
> Govind Balaji S
>
>

Reply via email to