Hello,

I’ve been working on optimizing boolean queries in OpenSearch, using Lucene 
10.1. For these tests I’ve been using our http_logs benchmark, which has a date 
field called “@timestamp”, an integer field called “status”, and a text field 
called “request”. I noticed there’s sometimes a large speedup when we move a 
numeric MUST clause to a FILTER clause, and I was wondering if I could get some 
help on figuring out why. For example, this original query takes ~1500 ms:

"bool": {
    "must": [
        {"match": {"request": "images"}},
        {"range": {"@timestamp": {
            "gte": "1998-06-10T00:00:00Z",
            "lt": "1998-06-13T00:00:00Z"
        }}}
    ]
}
But when we rewrite the query so the range query is a FILTER clause, it speeds 
up to ~40 ms:

"bool": {
    "must": {"match": {"request": "images"}},
    "filter": {
        "range": {"@timestamp": {
            "gte": "1998-06-10T00:00:00Z",
            "lt": "1998-06-13T00:00:00Z"
        }}
    }
}
This speedup seems to happen for all queries where the leading (cheapest) 
iterator in the conjunction is the numeric one, which ultimately wraps a 
BitSetIterator. If the text query is leading, we see no speedup, or sometimes a 
slowdown.

The speedup is not because we save time scoring the range query, which was my 
initial motivation for trying this. The range query uses a ConstantScoreScorer, 
and it seems to be so fast that I’ve never seen any samples of it when taking a 
flamegraph.

My next assumption was that, by having only one MUST clause, we were enabling 
use of ImpactsDISI, which was skipping lots of non-competitive doc IDs.  
TermScorer only uses ImpactsDISI when it’s the top-level scoring clause. 
BooleanScorerSupplier.setTopLevelScoringClause() only propagates the call of 
setTopLevelScoringClause() to its clauses when (must_clauses + should_clauses) 
= 1.

Supporting this evidence, I saw ImpactsDISI only appeared in the flamegraphs 
for the rewritten query. When I profiled the two queries using OpenSearch’s 
profile API, I saw in the rewritten query, both the term query and range query 
called advance() way less than in the original (23M → 1500 calls for the range 
query, 21M → 12k calls for the term query), and this was similar for the number 
of calls to score() . Finally I made a mock version of the dataset and locally 
ran queries while using IntelliJ’s debugger, and I was able to see ImpactDISI 
in the rewritten query skipping 100s or 1000s of doc IDs at a time. So, I 
figured the rewritten query was so much faster because we skip so many 
documents.

To test this I built OpenSearch with a modified test version of Lucene where 
BooleanScorerSupplier propagated setTopLevelScoringClause() to its clauses in 
both cases. I assumed I’d see both the original and rewritten queries skipping 
doc IDs and taking ~40 ms. But there was no speedup for the MUST query. From 
flamegraphs, I saw it was indeed now using ImpactsDISI as I expected. When I 
profiled the queries, I saw the number of advance() calls was still 21M, not 
12k. So it seems we aren’t skipping doc IDs in this case. (Unfortunately I 
can’t debug with the modified Lucene to confirm this). From my understanding of 
ImpactsDISI, it should work the same. Do you know why this might be behaving 
differently when it’s in the MUST clause?

Alternatively, am I completely on the wrong track? Maybe something else besides 
ImpactsDISI is allowing us to skip large numbers of calls to advance() in the 
rewritten query, so that enabling it for the original query didn’t actually do 
anything.

Sorry for the long description - I really appreciate the help! :)

Thanks,
Peter

Reply via email to