Hello,
I’ve been working on optimizing boolean queries in OpenSearch, using Lucene
10.1. For these tests I’ve been using our http_logs benchmark, which has a date
field called “@timestamp”, an integer field called “status”, and a text field
called “request”. I noticed there’s sometimes a large speedup when we move a
numeric MUST clause to a FILTER clause, and I was wondering if I could get some
help on figuring out why. For example, this original query takes ~1500 ms:
"bool": {
"must": [
{"match": {"request": "images"}},
{"range": {"@timestamp": {
"gte": "1998-06-10T00:00:00Z",
"lt": "1998-06-13T00:00:00Z"
}}}
]
}
But when we rewrite the query so the range query is a FILTER clause, it speeds
up to ~40 ms:
"bool": {
"must": {"match": {"request": "images"}},
"filter": {
"range": {"@timestamp": {
"gte": "1998-06-10T00:00:00Z",
"lt": "1998-06-13T00:00:00Z"
}}
}
}
This speedup seems to happen for all queries where the leading (cheapest)
iterator in the conjunction is the numeric one, which ultimately wraps a
BitSetIterator. If the text query is leading, we see no speedup, or sometimes a
slowdown.
The speedup is not because we save time scoring the range query, which was my
initial motivation for trying this. The range query uses a ConstantScoreScorer,
and it seems to be so fast that I’ve never seen any samples of it when taking a
flamegraph.
My next assumption was that, by having only one MUST clause, we were enabling
use of ImpactsDISI, which was skipping lots of non-competitive doc IDs.
TermScorer only uses ImpactsDISI when it’s the top-level scoring clause.
BooleanScorerSupplier.setTopLevelScoringClause() only propagates the call of
setTopLevelScoringClause() to its clauses when (must_clauses + should_clauses)
= 1.
Supporting this evidence, I saw ImpactsDISI only appeared in the flamegraphs
for the rewritten query. When I profiled the two queries using OpenSearch’s
profile API, I saw in the rewritten query, both the term query and range query
called advance() way less than in the original (23M → 1500 calls for the range
query, 21M → 12k calls for the term query), and this was similar for the number
of calls to score() . Finally I made a mock version of the dataset and locally
ran queries while using IntelliJ’s debugger, and I was able to see ImpactDISI
in the rewritten query skipping 100s or 1000s of doc IDs at a time. So, I
figured the rewritten query was so much faster because we skip so many
documents.
To test this I built OpenSearch with a modified test version of Lucene where
BooleanScorerSupplier propagated setTopLevelScoringClause() to its clauses in
both cases. I assumed I’d see both the original and rewritten queries skipping
doc IDs and taking ~40 ms. But there was no speedup for the MUST query. From
flamegraphs, I saw it was indeed now using ImpactsDISI as I expected. When I
profiled the queries, I saw the number of advance() calls was still 21M, not
12k. So it seems we aren’t skipping doc IDs in this case. (Unfortunately I
can’t debug with the modified Lucene to confirm this). From my understanding of
ImpactsDISI, it should work the same. Do you know why this might be behaving
differently when it’s in the MUST clause?
Alternatively, am I completely on the wrong track? Maybe something else besides
ImpactsDISI is allowing us to skip large numbers of calls to advance() in the
rewritten query, so that enabling it for the original query didn’t actually do
anything.
Sorry for the long description - I really appreciate the help! :)
Thanks,
Peter