The better solution is to use Terms.intersect. Then the postings
format can do the right thing. But this query doesn't use
Terms.intersect today, instead doing ping-ponging itself.

That's the problem.

We must *not* tune our algorithms for amazon's search but instead what
is the best for users (default postings format).

On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote:
>
> Hi Greg
> IMO I still think the seekCeil is a better solution for the default posting 
> format, as it could potentially save time on traversing the FST by doing the 
> ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be 
> better but I'm not sure whether there is a better way than overriding the 
> `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote:
>>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote 
>> TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery 
>> can now leverage the various "rewrite methods" available to MultiTermQuery, 
>> allowing users to customize the query evaluation strategy (e.g., postings 
>> vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we 
>> didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene 
>> snapshot we use for Amazon product search, and started running some 
>> additional benchmarks, which have surfaced an interesting issue. One 
>> use-case we have for TermInSetQuery creates a term disjunction over a field 
>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because 
>> bloom filtering can only help with seekExact and not seekCeil, we're seeing 
>> a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact approach 
>> when creating the filtered TermsEnum used by MultiTermQuery (for the 
>> TermInSetQuery implementation). Because TermInSetQuery can provide all of 
>> its terms up-front, we can have a simpler term intersection implementation 
>> that relies on seekExact over seekCeil. Here's a quick take on what I'm 
>> thinking: 
>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>>  I've tested this internally and confirmed it solves our QPS regression 
>> problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term 
>> intersection approach for TermInSetQuery, or has alternative ideas. I wonder 
>> if I'm overlooking some important factors and focusing too much on this 
>> specific case where the bloom filter interaction is hurting performance? It 
>> seems like seekCeil could provide benefits in some cases over seekExact by 
>> skipping over multiple query terms at a time, so that's a possible 
>> consideration. If we solve for the most common cases by default, I suppose 
>> advanced users could always override TermInSetQuery#getTermsEnum as 
>> necessary (we could take this approach internally for example to work with 
>> our bloom filtering if the best default is to leverage seekCeil). I can 
>> easily turn my quick solution into a PR, but before I do, I wanted to poll 
>> this group for thoughts on the approach or other alternatives I might be 
>> overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to