Hello Lucene devs,

I'd like to surface a benchmark observation about TermInSetQuery's choice
of PrefixCodedTerms as the in-memory term storage, and am curious if the
community is aware of this RAM-CPU tradeoff at large number of terms, and
the reasoning behind.

## Background
Before LUCENE-6350 (May 2015), TermsQuery (the predecessor of
TermInSetQuery) stored its terms as a sorted byte[] with a parallel int[]
of offsets, plus a cached int hashCode.
LUCENE-6350 replaced this shape with PrefixCodedTerms. The rationale
recorded in the issue was "This will save ram and cleanup a lot of the
code."

## The workload where this shows up
  - We run TermInSetQuery with a tens of thousands terms per query.
  - The LRUQueryCache skips caching the query on most segments because they
are below its per-leaf size threshold. In our workloads, this is ~90% of
segments (cumulatively a small fraction of docs).
    On those segments the query pays its full per-segment cost of unpacking
and iterating the same prefix coded terms on every request.
  - Our workloads are CPU-bound than memory-bound.

In this setup, the work that is independent of segment size becomes the
dominant CPU cost of the query on all those many small segments.
PrefixCodedTerms.TermIterator.next loading delta-encoded bytes show up as
hot in profiles.

## What we tried
A reimplementation of TermInSetQuery whose only change is the storage shape
of the term list.
Same FilteredTermsEnum ping-pong (accept / nextSeekTerm) - just driven by
random access into a BytesRef[] instead of a streaming TermIterator.
Conceptually this is closer to the pre-LUCENE-6350 storage layout. We also
avoid the packTerms() cost in the constructor with this.

We additionally keep the term bytes in a single contiguous byte[]
(length-prefixed VInts, BytesRef[] entries are zero-copy views into it),
and have equals/hashCode operate solely on that byte[].
That is how the pre-6350 implementation got its fast equals as well.
BytesRef[] is still redundant, but did not want to mess around with the
interfacing types too much and seemed okay.

## Results
JMH microbenchmarks against Lucene's TermInSetQuery (Query construction +
IndexSearcher.count(query)), on an in-memory index with the query cache
disabled.
We try a few different N terms in the TermInSetQuery, with each term being
a random 16 char hex.
For the benchmark, the index contains 50 segments and we vary what terms
are indexed in each segment too as follows.
Each indexed term is also a 16 char hex, and we index 1 doc for each term
that we chose to have indexed. Reported as speedup = TermInSetQuery time /
our impl time; >1 means our impl is faster.

  Index content                                                   | N = 300
| N = 3k |  N = 30k
  each segment has exactly the same terms in the random query     | 1.30x
 | 1.31x  |  1.22x
  Only 2 of N query terms in each segment, most seeks miss        | 2.11x
 | 3.41x  |  3.74x
  50k random (independent of the query) terms in each segment     | 1.05x
 | 1.15x  |  1.32x

In production (opensearch indexes with multiple shards per index), where
this TermInSetQuery is only a subquery of our typical query,
we saw the overall mean latency decrease by ~22% with this change when the
number of terms in TermInSetQuery > 30K. (Decreased by ~2% for queries with
number of terms between 1k to 5k).

## What I'd like to ask
Is this pattern of queries considered atypical and not worth the CPU at the
cost of memory? Is this also a guiding principle in other parts of lucene?
Or is there a quantified typical values for CPU-memory tradeoff? Asking out
of curiousity whether we should be tweaking more stuff similarly,
internally, for our CPU-bound workloads.

Happy to put a PR up with the JMH benchmark/the impl we tried if that's of
interest.

Thanks,
Govind Balaji S

Reply via email to