Hi Greg
IMO I still think the seekCeil is a better solution for the default posting
format, as it could potentially save time on traversing the FST by doing
the ping-pong skipping.
I can see that in the case of using bloom filter the seekExact might be
better but I'm not sure whether there is a better way than overriding the
`getTermsEnum`...

Patrick

On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote:

> Hi folks-
>
> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
> TermInSetQuery can now leverage the various "rewrite methods" available to
> MultiTermQuery, allowing users to customize the query evaluation strategy
> (e.g., postings vs. doc values, etc.), which was a nice win. In the
> benchmarks we ran, we didn't see any performance issues.
>
> In anticipation of 9.6 releasing, I've pulled this change into the Lucene
> snapshot we use for Amazon product search, and started running some
> additional benchmarks, which have surfaced an interesting issue. One
> use-case we have for TermInSetQuery creates a term disjunction over a field
> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
> bloom filtering can only help with seekExact and not seekCeil, we're seeing
> a performance regression (primarily in red-line QPS).
>
> One way I can think to address this is to move back to a seekExact
> approach when creating the filtered TermsEnum used by MultiTermQuery (for
> the TermInSetQuery implementation). Because TermInSetQuery can provide all
> of its terms up-front, we can have a simpler term intersection
> implementation that relies on seekExact over seekCeil. Here's a quick take
> on what I'm thinking:
> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
> I've tested this internally and confirmed it solves our QPS regression
> problem.
>
> I'm curious if anyone has an objection to moving back to a seekExact term
> intersection approach for TermInSetQuery, or has alternative ideas. I
> wonder if I'm overlooking some important factors and focusing too much on
> this specific case where the bloom filter interaction is hurting
> performance? It seems like seekCeil could provide benefits in some cases
> over seekExact by skipping over multiple query terms at a time, so that's a
> possible consideration. If we solve for the most common cases by default, I
> suppose advanced users could always override TermInSetQuery#getTermsEnum as
> necessary (we could take this approach internally for example to work with
> our bloom filtering if the best default is to leverage seekCeil). I can
> easily turn my quick solution into a PR, but before I do, I wanted to poll
> this group for thoughts on the approach or other alternatives I might be
> overlooking. Thanks in advance!
>
> Cheers,
> -Greg
>

Reply via email to