Hello, I’ve been working on optimizing boolean queries in OpenSearch, using Lucene 10.1. For these tests I’ve been using our http_logs benchmark, which has a date field called “@timestamp”, an integer field called “status”, and a text field called “request”. I noticed there’s sometimes a large speedup when we move a numeric MUST clause to a FILTER clause, and I was wondering if I could get some help on figuring out why. For example, this original query takes ~1500 ms:
"bool": { "must": [ {"match": {"request": "images"}}, {"range": {"@timestamp": { "gte": "1998-06-10T00:00:00Z", "lt": "1998-06-13T00:00:00Z" }}} ] } But when we rewrite the query so the range query is a FILTER clause, it speeds up to ~40 ms: "bool": { "must": {"match": {"request": "images"}}, "filter": { "range": {"@timestamp": { "gte": "1998-06-10T00:00:00Z", "lt": "1998-06-13T00:00:00Z" }} } } This speedup seems to happen for all queries where the leading (cheapest) iterator in the conjunction is the numeric one, which ultimately wraps a BitSetIterator. If the text query is leading, we see no speedup, or sometimes a slowdown. The speedup is not because we save time scoring the range query, which was my initial motivation for trying this. The range query uses a ConstantScoreScorer, and it seems to be so fast that I’ve never seen any samples of it when taking a flamegraph. My next assumption was that, by having only one MUST clause, we were enabling use of ImpactsDISI, which was skipping lots of non-competitive doc IDs. TermScorer only uses ImpactsDISI when it’s the top-level scoring clause. BooleanScorerSupplier.setTopLevelScoringClause() only propagates the call of setTopLevelScoringClause() to its clauses when (must_clauses + should_clauses) = 1. Supporting this evidence, I saw ImpactsDISI only appeared in the flamegraphs for the rewritten query. When I profiled the two queries using OpenSearch’s profile API, I saw in the rewritten query, both the term query and range query called advance() way less than in the original (23M → 1500 calls for the range query, 21M → 12k calls for the term query), and this was similar for the number of calls to score() . Finally I made a mock version of the dataset and locally ran queries while using IntelliJ’s debugger, and I was able to see ImpactDISI in the rewritten query skipping 100s or 1000s of doc IDs at a time. So, I figured the rewritten query was so much faster because we skip so many documents. To test this I built OpenSearch with a modified test version of Lucene where BooleanScorerSupplier propagated setTopLevelScoringClause() to its clauses in both cases. I assumed I’d see both the original and rewritten queries skipping doc IDs and taking ~40 ms. But there was no speedup for the MUST query. From flamegraphs, I saw it was indeed now using ImpactsDISI as I expected. When I profiled the queries, I saw the number of advance() calls was still 21M, not 12k. So it seems we aren’t skipping doc IDs in this case. (Unfortunately I can’t debug with the modified Lucene to confirm this). From my understanding of ImpactsDISI, it should work the same. Do you know why this might be behaving differently when it’s in the MUST clause? Alternatively, am I completely on the wrong track? Maybe something else besides ImpactsDISI is allowing us to skip large numbers of calls to advance() in the rewritten query, so that enabling it for the original query didn’t actually do anything. Sorry for the long description - I really appreciate the help! :) Thanks, Peter