Re: Question about ImpactsDISI for boolean queries

Alfonsi, Peter Mon, 21 Apr 2025 17:30:27 -0700

Hi Adrien, 

Thanks for the quick reply! This makes sense. I think 
BlockMaxConjunctionBulkScorer actually never calls setMinCompetitiveScore() at 
all, so there's no hope of skipping, while ConjunctionScorer does in the case 
that there's only one scorer (which happens when we move the range query to 
FILTER).

This week I'll try and make some Lucene changes doing what you described, and 
see if I can show a speedup in a Lucene benchmark rather than OpenSearch 
Benchmarks. I'm a bit unsure about rewriting all numeric MUST clauses to 
FILTER, since I did see slowdowns in some cases, but it's possible this is just 
due to https://github.com/apache/lucene/issues/14445 and in this case I should 
be able to confirm it easily by building with the draft fix and seeing if the 
slowdowns go away. 

Would definitely appreciate help with reviews, thanks for offering.
--Peter

On 4/21/25, 1:17 PM, "Adrien Grand" <jpou...@gmail.com 
<mailto:jpou...@gmail.com>> wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

You are on the right track.

It's easier to skip by score when there is a single scoring clause than
when the score is the sum of the scores of two clauses. Well, actually in
this case two clauses are not much harder since one of the clauses gives
the same score to all documents, but the conjunction scorer doesn't
know this so it doesn't take advantage of it.

Plus, when you have two clauses and the constant-scoring query (the range
query in your case) is leading (ie. has a lower cost), it also drives
impacts, and since its Scorer#advanceShallow returns MAX_VALUE, then the
whole segment is treated as a single block, which reduces opportunities for
skipping by a lot compared with smaller blocks.

Propagating setTopLevelScoringClause to conjunctive clauses alone did not
help, you would also need the conjunction scorer to propagate
setMinCompetitiveScore if you wanted it to help, but in my opinion, the
easier/best fix to improve this case would be to rewrite MUST clauses that
produce constant scores as FILTER clauses, and wrap the parent BooleanQuery
in a query wrapper that adds the score of this constant-scoring clause to
all hits. Then your two queries would evaluate the same way in practice.
I'm happy to help with reviews if you'd like to give it a try.

On Mon, Apr 21, 2025 at 7:02 PM Alfonsi, Peter <petea...@amazon.com.inva 
<mailto:petea...@amazon.com.inva>lid>
wrote:

> Hello,
>
> I’ve been working on optimizing boolean queries in OpenSearch, using
> Lucene 10.1. For these tests I’ve been using our http_logs benchmark, which
> has a date field called “@timestamp”, an integer field called “status”, and
> a text field called “request”. I noticed there’s sometimes a large speedup
> when we move a numeric MUST clause to a FILTER clause, and I was wondering
> if I could get some help on figuring out why. For example, this original
> query takes ~1500 ms:
>
> "bool": {
> "must": [
> {"match": {"request": "images"}},
> {"range": {"@timestamp": {
> "gte": "1998-06-10T00:00:00Z",
> "lt": "1998-06-13T00:00:00Z"
> }}}
> ]
> }
> But when we rewrite the query so the range query is a FILTER clause, it
> speeds up to ~40 ms:
>
> "bool": {
> "must": {"match": {"request": "images"}},
> "filter": {
> "range": {"@timestamp": {
> "gte": "1998-06-10T00:00:00Z",
> "lt": "1998-06-13T00:00:00Z"
> }}
> }
> }
> This speedup seems to happen for all queries where the leading (cheapest)
> iterator in the conjunction is the numeric one, which ultimately wraps a
> BitSetIterator. If the text query is leading, we see no speedup, or
> sometimes a slowdown.
>
> The speedup is not because we save time scoring the range query, which was
> my initial motivation for trying this. The range query uses a
> ConstantScoreScorer, and it seems to be so fast that I’ve never seen any
> samples of it when taking a flamegraph.
>
> My next assumption was that, by having only one MUST clause, we were
> enabling use of ImpactsDISI, which was skipping lots of non-competitive doc
> IDs. TermScorer only uses ImpactsDISI when it’s the top-level scoring
> clause. BooleanScorerSupplier.setTopLevelScoringClause() only propagates
> the call of setTopLevelScoringClause() to its clauses when (must_clauses +
> should_clauses) = 1.
>
> Supporting this evidence, I saw ImpactsDISI only appeared in the
> flamegraphs for the rewritten query. When I profiled the two queries using
> OpenSearch’s profile API, I saw in the rewritten query, both the term query
> and range query called advance() way less than in the original (23M → 1500
> calls for the range query, 21M → 12k calls for the term query), and this
> was similar for the number of calls to score() . Finally I made a mock
> version of the dataset and locally ran queries while using IntelliJ’s
> debugger, and I was able to see ImpactDISI in the rewritten query skipping
> 100s or 1000s of doc IDs at a time. So, I figured the rewritten query was
> so much faster because we skip so many documents.
>
> To test this I built OpenSearch with a modified test version of Lucene
> where BooleanScorerSupplier propagated setTopLevelScoringClause() to its
> clauses in both cases. I assumed I’d see both the original and rewritten
> queries skipping doc IDs and taking ~40 ms. But there was no speedup for
> the MUST query. From flamegraphs, I saw it was indeed now using ImpactsDISI
> as I expected. When I profiled the queries, I saw the number of advance()
> calls was still 21M, not 12k. So it seems we aren’t skipping doc IDs in
> this case. (Unfortunately I can’t debug with the modified Lucene to confirm
> this). From my understanding of ImpactsDISI, it should work the same. Do
> you know why this might be behaving differently when it’s in the MUST
> clause?
>
> Alternatively, am I completely on the wrong track? Maybe something else
> besides ImpactsDISI is allowing us to skip large numbers of calls to
> advance() in the rewritten query, so that enabling it for the original
> query didn’t actually do anything.
>
> Sorry for the long description - I really appreciate the help! :)
>
> Thanks,
> Peter
>
>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Question about ImpactsDISI for boolean queries

Reply via email to