Hi folks-

Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote
TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery
can now leverage the various "rewrite methods" available to MultiTermQuery,
allowing users to customize the query evaluation strategy (e.g., postings
vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we
didn't see any performance issues.

In anticipation of 9.6 releasing, I've pulled this change into the Lucene
snapshot we use for Amazon product search, and started running some
additional benchmarks, which have surfaced an interesting issue. One
use-case we have for TermInSetQuery creates a term disjunction over a field
that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
bloom filtering can only help with seekExact and not seekCeil, we're seeing
a performance regression (primarily in red-line QPS).

One way I can think to address this is to move back to a seekExact approach
when creating the filtered TermsEnum used by MultiTermQuery (for the
TermInSetQuery implementation). Because TermInSetQuery can provide all of
its terms up-front, we can have a simpler term intersection implementation
that relies on seekExact over seekCeil. Here's a quick take on what I'm
thinking:
https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
I've tested this internally and confirmed it solves our QPS regression
problem.

I'm curious if anyone has an objection to moving back to a seekExact term
intersection approach for TermInSetQuery, or has alternative ideas. I
wonder if I'm overlooking some important factors and focusing too much on
this specific case where the bloom filter interaction is hurting
performance? It seems like seekCeil could provide benefits in some cases
over seekExact by skipping over multiple query terms at a time, so that's a
possible consideration. If we solve for the most common cases by default, I
suppose advanced users could always override TermInSetQuery#getTermsEnum as
necessary (we could take this approach internally for example to work with
our bloom filtering if the best default is to leverage seekCeil). I can
easily turn my quick solution into a PR, but before I do, I wanted to poll
this group for thoughts on the approach or other alternatives I might be
overlooking. Thanks in advance!

Cheers,
-Greg

Reply via email to