Hi Greg,

Thanks for looking into this. PR here with the new impl + JMH benchmark:
https://github.com/apache/lucene/pull/16051. The JMH numbers are in the PR
description.

Regards,
Govind Balaji S

On Fri, May 8, 2026 at 9:30 PM Greg Miller <[email protected]> wrote:

> Thanks for sharing this work Govind. This is interesting stuff!
>
> My brief 2 cents: I suspect it's a somewhat atypical pattern to have
> queries that are dominated by large cardinality term-in-set disjunctions.
> So my initial reaction to this is that I suspect the majority of use-cases
> would prefer the memory efficient query representation. But that's just one
> opinion.
>
> That said, it would be great to publish your benchmarking in a PR! Let's
> capture this testing and have a discussion. Even if the community decides
> to stay with the current approach, you might want to consider adding your
> version of this query in the sandbox module. That could provide a nice,
> community-supported option for a more CPU efficient version of this query
> that users could opt into.
>
> Cheers,
> -Greg
>
> On Thu, May 7, 2026 at 3:20 AM Govind Balaji via dev <
> [email protected]> wrote:
>
>> Hello Lucene devs,
>>
>> I'd like to surface a benchmark observation about TermInSetQuery's choice
>> of PrefixCodedTerms as the in-memory term storage, and am curious if the
>> community is aware of this RAM-CPU tradeoff at large number of terms, and
>> the reasoning behind.
>>
>> ## Background
>> Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of
>> TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[]
>> of offsets, plus a cached int hashCode.
>> LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale
>> recorded in the issue was "This will save ram and cleanup a lot of the
>> code."
>>
>> ## The workload where this shows up
>>   - We run TermInSetQuery with a tens of thousands terms per query.
>>   - The LRUQueryCache skips caching the query on most segments because
>> they are below its per-leaf size threshold. In our workloads, this is ~90%
>> of segments (cumulatively a small fraction of docs).
>>     On those segments the query pays its full per-segment cost of
>> unpacking and iterating the same prefix coded terms on every request.
>>   - Our workloads are CPU-bound than memory-bound.
>>
>> In this setup, the work that is independent of segment size becomes the
>> dominant CPU cost of the query on all those many small segments.
>> PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as
>> hot in profiles.
>>
>> ## What we tried
>> A reimplementation of TermInSetQuery whose only change is the storage
>> shape of the term list.
>> Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by
>> random access into a BytesRef[] instead of a streaming TermIterator.
>> Conceptually this is closer to the pre-LUCENE-6350 storage layout. We
>> also avoid the packTerms() cost in the constructor with this.
>>
>> We additionally keep the term bytes in a single contiguous byte[]
>> (length-prefixed VInts, BytesRef[] entries are zero-copy views into it),
>> and have equals/hashCode operate solely on that byte[].
>> That is how the pre-6350 implementation got its fast equals as well.
>> BytesRef[] is still redundant, but did not want to mess around with the
>> interfacing types too much and seemed okay.
>>
>> ## Results
>> JMH microbenchmarks against Lucene's TermInSetQuery (Query construction +
>> IndexSearcher.count(query)), on an in-memory index with the query cache
>> disabled.
>> We try a few different N terms in the TermInSetQuery, with each term
>> being a random 16 char hex.
>> For the benchmark, the index contains 50 segments and we vary what terms
>> are indexed in each segment too as follows.
>> Each indexed term is also a 16 char hex, and we index 1 doc for each term
>> that we chose to have indexed. Reported as speedup = TermInSetQuery time /
>> our impl time; >1 means our impl is faster.
>>
>>   Index content                                                   | N =
>> 300 | N = 3k |  N = 30k
>>   each segment has exactly the same terms in the random query     |
>> 1.30x   | 1.31x  |  1.22x
>>   Only 2 of N query terms in each segment, most seeks miss        |
>> 2.11x   | 3.41x  |  3.74x
>>   50k random (independent of the query) terms in each segment     |
>> 1.05x   | 1.15x  |  1.32x
>>
>> In production (opensearch indexes with multiple shards per index), where
>> this TermInSetQuery is only a subquery of our typical query,
>> we saw the overall mean latency decrease by ~22% with this change when
>> the number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries
>> with number of terms between 1k to 5k).
>>
>> ## What I'd like to ask
>> Is this pattern of queries considered atypical and not worth the CPU at
>> the cost of memory? Is this also a guiding principle in other parts of
>> lucene?
>> Or is there a quantified typical values for CPU-memory tradeoff? Asking
>> out of curiousity whether we should be tweaking more stuff similarly,
>> internally, for our CPU-bound workloads.
>>
>> Happy to put a PR up with the JMH benchmark/the impl we tried if that's
>> of interest.
>>
>> Thanks,
>> Govind Balaji S
>>
>>

Reply via email to