Hi Greg IMO I still think the seekCeil is a better solution for the default posting format, as it could potentially save time on traversing the FST by doing the ping-pong skipping. I can see that in the case of using bloom filter the seekExact might be better but I'm not sure whether there is a better way than overriding the `getTermsEnum`...
Patrick On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote: > Hi folks- > > Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we > rewrote TermInSetQuery to extend MultiTermQuery. With this change, > TermInSetQuery can now leverage the various "rewrite methods" available to > MultiTermQuery, allowing users to customize the query evaluation strategy > (e.g., postings vs. doc values, etc.), which was a nice win. In the > benchmarks we ran, we didn't see any performance issues. > > In anticipation of 9.6 releasing, I've pulled this change into the Lucene > snapshot we use for Amazon product search, and started running some > additional benchmarks, which have surfaced an interesting issue. One > use-case we have for TermInSetQuery creates a term disjunction over a field > that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because > bloom filtering can only help with seekExact and not seekCeil, we're seeing > a performance regression (primarily in red-line QPS). > > One way I can think to address this is to move back to a seekExact > approach when creating the filtered TermsEnum used by MultiTermQuery (for > the TermInSetQuery implementation). Because TermInSetQuery can provide all > of its terms up-front, we can have a simpler term intersection > implementation that relies on seekExact over seekCeil. Here's a quick take > on what I'm thinking: > https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. > I've tested this internally and confirmed it solves our QPS regression > problem. > > I'm curious if anyone has an objection to moving back to a seekExact term > intersection approach for TermInSetQuery, or has alternative ideas. I > wonder if I'm overlooking some important factors and focusing too much on > this specific case where the bloom filter interaction is hurting > performance? It seems like seekCeil could provide benefits in some cases > over seekExact by skipping over multiple query terms at a time, so that's a > possible consideration. If we solve for the most common cases by default, I > suppose advanced users could always override TermInSetQuery#getTermsEnum as > necessary (we could take this approach internally for example to work with > our bloom filtering if the best default is to leverage seekCeil). I can > easily turn my quick solution into a PR, but before I do, I wanted to poll > this group for thoughts on the approach or other alternatives I might be > overlooking. Thanks in advance! > > Cheers, > -Greg >