Hi folks- Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery can now leverage the various "rewrite methods" available to MultiTermQuery, allowing users to customize the query evaluation strategy (e.g., postings vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we didn't see any performance issues.
In anticipation of 9.6 releasing, I've pulled this change into the Lucene snapshot we use for Amazon product search, and started running some additional benchmarks, which have surfaced an interesting issue. One use-case we have for TermInSetQuery creates a term disjunction over a field that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because bloom filtering can only help with seekExact and not seekCeil, we're seeing a performance regression (primarily in red-line QPS). One way I can think to address this is to move back to a seekExact approach when creating the filtered TermsEnum used by MultiTermQuery (for the TermInSetQuery implementation). Because TermInSetQuery can provide all of its terms up-front, we can have a simpler term intersection implementation that relies on seekExact over seekCeil. Here's a quick take on what I'm thinking: https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. I've tested this internally and confirmed it solves our QPS regression problem. I'm curious if anyone has an objection to moving back to a seekExact term intersection approach for TermInSetQuery, or has alternative ideas. I wonder if I'm overlooking some important factors and focusing too much on this specific case where the bloom filter interaction is hurting performance? It seems like seekCeil could provide benefits in some cases over seekExact by skipping over multiple query terms at a time, so that's a possible consideration. If we solve for the most common cases by default, I suppose advanced users could always override TermInSetQuery#getTermsEnum as necessary (we could take this approach internally for example to work with our bloom filtering if the best default is to leverage seekCeil). I can easily turn my quick solution into a PR, but before I do, I wanted to poll this group for thoughts on the approach or other alternatives I might be overlooking. Thanks in advance! Cheers, -Greg