The better solution is to use Terms.intersect. Then the postings format can do the right thing. But this query doesn't use Terms.intersect today, instead doing ping-ponging itself.
That's the problem. We must *not* tune our algorithms for amazon's search but instead what is the best for users (default postings format). On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote: > > Hi Greg > IMO I still think the seekCeil is a better solution for the default posting > format, as it could potentially save time on traversing the FST by doing the > ping-pong skipping. > I can see that in the case of using bloom filter the seekExact might be > better but I'm not sure whether there is a better way than overriding the > `getTermsEnum`... > > Patrick > > On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote: >> >> Hi folks- >> >> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote >> TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery >> can now leverage the various "rewrite methods" available to MultiTermQuery, >> allowing users to customize the query evaluation strategy (e.g., postings >> vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we >> didn't see any performance issues. >> >> In anticipation of 9.6 releasing, I've pulled this change into the Lucene >> snapshot we use for Amazon product search, and started running some >> additional benchmarks, which have surfaced an interesting issue. One >> use-case we have for TermInSetQuery creates a term disjunction over a field >> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because >> bloom filtering can only help with seekExact and not seekCeil, we're seeing >> a performance regression (primarily in red-line QPS). >> >> One way I can think to address this is to move back to a seekExact approach >> when creating the filtered TermsEnum used by MultiTermQuery (for the >> TermInSetQuery implementation). Because TermInSetQuery can provide all of >> its terms up-front, we can have a simpler term intersection implementation >> that relies on seekExact over seekCeil. Here's a quick take on what I'm >> thinking: >> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. >> I've tested this internally and confirmed it solves our QPS regression >> problem. >> >> I'm curious if anyone has an objection to moving back to a seekExact term >> intersection approach for TermInSetQuery, or has alternative ideas. I wonder >> if I'm overlooking some important factors and focusing too much on this >> specific case where the bloom filter interaction is hurting performance? It >> seems like seekCeil could provide benefits in some cases over seekExact by >> skipping over multiple query terms at a time, so that's a possible >> consideration. If we solve for the most common cases by default, I suppose >> advanced users could always override TermInSetQuery#getTermsEnum as >> necessary (we could take this approach internally for example to work with >> our bloom filtering if the best default is to leverage seekCeil). I can >> easily turn my quick solution into a PR, but before I do, I wanted to poll >> this group for thoughts on the approach or other alternatives I might be >> overlooking. Thanks in advance! >> >> Cheers, >> -Greg --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org