Hello Lucene devs,
I'd like to surface a benchmark observation about TermInSetQuery's choice
of PrefixCodedTerms as the in-memory term storage, and am curious if the
community is aware of this RAM-CPU tradeoff at large number of terms, and
the reasoning behind.
## Background
Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of
TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[]
of offsets, plus a cached int hashCode.
LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale
recorded in the issue was "This will save ram and cleanup a lot of the
code."
## The workload where this shows up
- We run TermInSetQuery with a tens of thousands terms per query.
- The LRUQueryCache skips caching the query on most segments because they
are below its per-leaf size threshold. In our workloads, this is ~90% of
segments (cumulatively a small fraction of docs).
On those segments the query pays its full per-segment cost of unpacking
and iterating the same prefix coded terms on every request.
- Our workloads are CPU-bound than memory-bound.
In this setup, the work that is independent of segment size becomes the
dominant CPU cost of the query on all those many small segments.
PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as
hot in profiles.
## What we tried
A reimplementation of TermInSetQuery whose only change is the storage shape
of the term list.
Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by
random access into a BytesRef[] instead of a streaming TermIterator.
Conceptually this is closer to the pre-LUCENE-6350 storage layout. We also
avoid the packTerms() cost in the constructor with this.
We additionally keep the term bytes in a single contiguous byte[]
(length-prefixed VInts, BytesRef[] entries are zero-copy views into it),
and have equals/hashCode operate solely on that byte[].
That is how the pre-6350 implementation got its fast equals as well.
BytesRef[] is still redundant, but did not want to mess around with the
interfacing types too much and seemed okay.
## Results
JMH microbenchmarks against Lucene's TermInSetQuery (Query construction +
IndexSearcher.count(query)), on an in-memory index with the query cache
disabled.
We try a few different N terms in the TermInSetQuery, with each term being
a random 16 char hex.
For the benchmark, the index contains 50 segments and we vary what terms
are indexed in each segment too as follows.
Each indexed term is also a 16 char hex, and we index 1 doc for each term
that we chose to have indexed. Reported as speedup = TermInSetQuery time /
our impl time; >1 means our impl is faster.
Index content | N = 300
| N = 3k | N = 30k
each segment has exactly the same terms in the random query | 1.30x
| 1.31x | 1.22x
Only 2 of N query terms in each segment, most seeks miss | 2.11x
| 3.41x | 3.74x
50k random (independent of the query) terms in each segment | 1.05x
| 1.15x | 1.32x
In production (opensearch indexes with multiple shards per index), where
this TermInSetQuery is only a subquery of our typical query,
we saw the overall mean latency decrease by ~22% with this change when the
number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries with
number of terms between 1k to 5k).
## What I'd like to ask
Is this pattern of queries considered atypical and not worth the CPU at the
cost of memory? Is this also a guiding principle in other parts of lucene?
Or is there a quantified typical values for CPU-memory tradeoff? Asking out
of curiousity whether we should be tweaking more stuff similarly,
internally, for our CPU-bound workloads.
Happy to put a PR up with the JMH benchmark/the impl we tried if that's of
interest.
Thanks,
Govind Balaji S